### **Ridefinizione del Progetto**

Il progetto si evolve in una pipeline duale che combina due approcci distinti per l'elaborazione di documenti scientifici su ArXiv, uno focalizzato sull'information retrieval multimodale per il QA e l'altro orientato alla creazione e utilizzo di un grafo per il reasoning avanzato. Ecco una sintesi della nuova struttura:

---

### **1. Obiettivo Generale**
Creare un Google Colab notebook che esplora due metodi distinti:
1. **Approccio Multimodale per il QA**: Retrieval di contenuti (testo e visivi) da documenti ArXiv con valutazione quantitativa delle prestazioni su un task di multiple-choice QA.
2. **GraphRAG per Reasoning Avanzato**: Creazione e utilizzo di un grafo di conoscenza a partire da documenti ArXiv per rispondere a domande complesse (multi-hop reasoning e aggregazioni) attraverso tecnologie avanzate.

---


### **2. Descrizione dei Due Moduli**

#### **Modulo 1: Approccio Multimodale per QA**
- **Focus**: Estrarre chunk di testo e figure dai PDF, utilizzarli in pipeline di retrieval (multimodale e text-only), e valutarli quantitativamente in un task di QA multiple-choice.
- **Tecnologie**:
  - Modelli multimodali (es. ColQwen2).
  - Late interaction per text-only retrieval.
  - Metriche di valutazione come Precision@k, MRR e accuratezza downstream su QA.
- **Output**:
  - Risultati numerici sulle prestazioni dei vari approcci.
  - Tabelle e grafici comparativi.

---

#### **Modulo 2: GraphRAG per Reasoning Avanzato**
- **Focus**: Costruire un grafo di conoscenza a partire dai documenti ArXiv (estrazione di entità e relazioni) per rispondere a domande complesse.
- **Pipeline**:
  1. **Estrazione di Entità e Relazioni**:
     - Uso di LLM (es. GPT-4o-mini) per identificare nodi e archi, con varianti per aggiungere forza relazionale o descrizioni.
  2. **Popolazione del Grafo**:
     - Archiviazione in un database Neo4j per interrogazioni successive.
  3. **Interrogazione del Grafo**:
     - Approcci:
       - Vicini diretti di entità.
       - Community detection e clustering.
       - Generazione di query Cypher con un LLM.
  4. **Reasoning e Risposta**:
     - Generazione di risposte condizionate da cluster di entità.
     - Combina risposte locali e globali con pesi stimati.
- **Tecnologie**:
  - **Neo4j** per graphDB.
  - **Milvus** per indicizzazione come VectorDB.
  - **LangGraph** per pipeline automatizzata.
- **Output**:
  - Demo interattiva con domande esempio.
  - Visualizzazioni di query sul grafo e risposte.

---

## Setup

In [None]:
!pip install datasets
!pip install pillow
!pip install matplotlib
!pip install transformers torch datasets byaldi
!pip install -q pdf2image
!pip install git+https://github.com/huggingface/transformers.git
!pip install qwen-vl-utils
!pip install flash-attn
!pip install byaldi

In [None]:
!pip install torch torchvision

In [None]:
!sudo apt-get install -y poppler-utils

[sudo] password for rgiordano: 


In [None]:
# Import required libraries
import os

from datasets import load_dataset

import torch
import transformers
from byaldi import RAGMultiModalModel
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

import tqdm

import requests
import re

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# Set device
device = "mps" if torch.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"

print(f"Using device: {device}")

Using device: cuda


## Load Original Dataset

In [None]:
"""
@misc{li2024multimodal,
            title={Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models},
            author={Lei Li and Yuqi Wang and Runxin Xu and Peiyi Wang and Xiachong Feng and Lingpeng Kong and Qi Liu},
            year={2024},
            eprint={2403.00231},
            archivePrefix={arXiv},
            primaryClass={cs.CV}
        }
        """

# Load parquet files inside folder data
# ds = load_dataset('parquet', data_files='data/test-00000-of-00001.parquet')

# Load combined ds
ds = load_dataset('parquet', data_files='data/processed_images_combined.parquet')

Generating train split: 500 examples [00:00, 1262.40 examples/s]


In [None]:
ds['train'][0]

{'query': 'Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?',
 'image': {'bytes': b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\x08\x06\x06\x07\x06\x05\x08\x07\x07\x07\t\t\x08\n\x0c\x14\r\x0c\x0b\x0b\x0c\x19\x12\x13\x0f\x14\x1d\x1a\x1f\x1e\x1d\x1a\x1c\x1c $.\' ",#\x1c\x1c(7),01444\x1f\'9=82<.342\xff\xdb\x00C\x01\t\t\t\x0c\x0b\x0c\x18\r\r\x182!\x1c!22222222222222222222222222222222222222222222222222\xff\xc0\x00\x11\x08\x04\xa7\x06U\x03\x01"\x00\x02\x11\x01\x03\x11\x01\xff\xc4\x00\x1f\x00\x00\x01\x05\x01\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\xff\xc4\x00\xb5\x10\x00\x02\x01\x03\x03\x02\x04\x03\x05\x05\x04\x04\x00\x00\x01}\x01\x02\x03\x00\x04\x11\x05\x12!1A\x06\x13Qa\x07"q\x142\x81\x91\xa1\x08#B\xb1\xc1\x15R\xd1\xf0$3br\x82\t\n\x16\x17\x18\x19\x1a%&\'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz\x83\x84\x85\x86\x87\x88\x89\x8a\x92\x93\x94\x9

## Download sources pdf

In [None]:
def download_arxiv_pdf(arxiv_id, save_path):
    url = f"https://arxiv.org/pdf/{arxiv_id}"
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        print(f"PDF downloaded successfully: {save_path}")
    else:
        raise Exception(f"Failed to download PDF. Status code: {response.status_code}")


In [None]:
save_path = "data/docs/"
os.makedirs(save_path, exist_ok=True)

errors = []
for item in ds['train']:
    arxiv_id = item['image_filename'].replace('images/','').split('_')[0]

    if not os.path.exists(f"{save_path}{arxiv_id}.pdf"):
        try:
            download_arxiv_pdf(arxiv_id, f"{save_path}{arxiv_id}.pdf")
        except Exception as e:
            try:
                arxiv_id_regex = re.compile(r'([a-z\-]+)(\d{7})')
                match = arxiv_id_regex.match(arxiv_id)
                arxiv_id = match.group(1) + '/' + match.group(2)
                download_arxiv_pdf(arxiv_id, f"{save_path}{arxiv_id}.pdf")
            except Exception as e:
                errors.append(arxiv_id)
                print(f"Failed to download PDF: {arxiv_id}")

Failed to download PDF: quant-ph/9912091
Failed to download PDF: cond-mat/0603861
Failed to download PDF: physics/0603179
Failed to download PDF: cond-mat/0201239
Failed to download PDF: cond-mat/0010301
Failed to download PDF: astro-ph/0207226
Failed to download PDF: cs/0505008
Failed to download PDF: cond-mat/0404614
Failed to download PDF: cond-mat/0507316
Failed to download PDF: astro-ph/9911146
Failed to download PDF: astro-ph/0407096
Failed to download PDF: cond-mat/0312100
Failed to download PDF: cond-mat/0303467
Failed to download PDF: nucl-th/0408026
Failed to download PDF: 2304.04203
Failed to download PDF: astro-ph/0302390
Failed to download PDF: cond-mat/0011289
Failed to download PDF: cond-mat/0304485
Failed to download PDF: cond-mat/0306096
Failed to download PDF: astro-ph/0309681
Failed to download PDF: cond-mat/0610297
Failed to download PDF: cond-mat/0103207
Failed to download PDF: quant-ph/0306172
Failed to download PDF: astro-ph/0007066
Failed to download PDF: cond-m

## Generation of images description

In [None]:
multimodal_model_name = "Qwen/Qwen2-VL-2B-Instruct"
multimodal_model = Qwen2VLForConditionalGeneration.from_pretrained(
                                                        multimodal_model_name,
                                                        trust_remote_code=True,
                                                        torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(multimodal_model_name, trust_remote_code=True)

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  6.32it/s]


In [None]:
def create_messages(img):
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": img,
                    "max_pixels": 720**2,
                },
                {
                    "type": "text",
                    "text":
                        "Based on the image, provide a detailed scientific description of the graph."
                },
            ],
        }
    ]

    return messages


In [None]:
def invoke_generation(messages):

    print('Applying vision template...')
    # Apply a chat template to the messages without tokenizing and add a generation prompt
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )

    print('Tokenizing text...')
    # Process vision information from the messages to get image and video inputs
    image_inputs, video_inputs = process_vision_info(messages)

    print('Preparing inputs...')
    # Prepare the inputs for the model by combining text, images, and videos
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",  # Return the inputs as PyTorch tensors
    )

    print('Moving inputs to device...')
    # Move the inputs to the specified device (e.g., GPU)
    inputs = inputs.to(device)

    print('Generating output...')
    # Generate output IDs from the model with a maximum of 500 new tokens
    generated_ids = multimodal_model.generate(**inputs, max_new_tokens=500)

    print('Decoding output...')
    # Trim the generated IDs to remove the input IDs from the beginning
    generated_ids_trimmed = [
        out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    print('Decoding text...')
    # Decode the trimmed generated IDs to get the output text
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    print('Output:', output_text)
    print('\n\n ---------------------------------------- \n\n')
    return output_text

In [None]:
# Function to process a single example
def process_item(example):
    try:
        img = example['image']  # Extract the image
        messages = create_messages(img)  # Create messages
        output_text = invoke_generation(messages)  # Generate output text
        return {"image_filename": example["image_filename"], "output_text": output_text}
    except Exception as e:
        print(f"Error processing item: {e}")

# Function to save processed data as a checkpoint
def save_checkpoint(processed_data, checkpoint_file):
    df = pd.DataFrame(processed_data)
    df.to_parquet(checkpoint_file, index=False)
    print(f"Checkpoint saved: {len(processed_data)} items to {checkpoint_file}")

# Process dataset in batches with checkpoints
def process_dataset_with_checkpoints(ds, batch_size=10, output_file="data/processed_images.parquet"):
    train_data = ds['train']  # Access training data
    processed_items = []  # To store processed items
    checkpoint_file = output_file

    for idx, example in enumerate(tqdm.tqdm(train_data)):
        processed_item = process_item(example)
        processed_items.append(processed_item)

        print(processed_items)

        # Save every batch_size items or at the end
        if (idx + 1) % batch_size == 0 or (idx + 1) == len(train_data):
            save_checkpoint(processed_items, checkpoint_file)

    return processed_items

# Run the processing
processed_dataset = process_dataset_with_checkpoints(
    ds, batch_size=10, output_file="data/processed_images.parquet"
)

In [None]:
proc_img = load_dataset('parquet', data_files='data/processed_images.parquet')['train']

Generating train split: 500 examples [00:00, 128140.78 examples/s]


In [None]:
# Join the processed data with the original dataset
df_1 = ds['train'].to_pandas()
df_2 = proc_img.to_pandas()
df = pd.merge(df_1, df_2, on="image_filename")

df.to_parquet("data/processed_images_combined.parquet", index=False)

## 1. Multimodal Representational Models with Late Interaction

#### Create visual embedding colqwen2-v0.1

In [None]:
# Save all images
PATH='data/images'
os.makedirs(PATH)

for i, item in enumerate(ds['train']):
    item['image'].save(f'data/{item["image_filename"]}')

In [None]:
RAG = RAGMultiModalModel.from_pretrained("./models/colqwen2-v0.1-merged", device=device)

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


In [None]:
try:
    RAG.index(
    input_path="data/images/", # The path to your documents
    index_name='colqwen2-v0.1-merged-arxiv_qa', # The name you want to give to your index. It'll be saved at `index_root/index_name/`.
    store_collection_with_index=False, # Whether the index should store the base64 encoded documents.
    overwrite=False # Whether to overwrite an index if it already exists. If False, it'll return None and do nothing if `index_root/index_name` exists.
)
except ValueError:
    RAG=RAG.from_index('colqwen2-v0.1-merged-arxiv_qa', device='cpu')

Verbosity is set to 1 (active). Pass verbose=0 to make quieter.


  self.indexed_embeddings.extend(torch.load(file))


In [None]:
text_query = 'Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?'
results = RAG.search(text_query, k=1)
results

[{'doc_id': 245, 'page_num': 1, 'score': 24.875, 'metadata': {}, 'base64': None}]

In [None]:
image_name='/'.join(RAG.get_doc_ids_to_file_names()[results[0]['doc_id']].split('/')[-2:])

In [None]:
# Find image_name in ds
for i, item in enumerate(ds['train']):
    if item['image_filename']==image_name:
        break

#### Create textual embedding Ragatoutille colbert-ir/colbertv2.0

In [None]:
!pip install ragatouille

Collecting ragatouille
  Downloading ragatouille-0.0.8.post4-py3-none-any.whl.metadata (15 kB)
Collecting colbert-ai==0.2.19 (from ragatouille)
  Downloading colbert-ai-0.2.19.tar.gz (86 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.7/86.7 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting faiss-cpu<2.0.0,>=1.7.4 (from ragatouille)
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting fast-pytorch-kmeans==0.2.0.1 (from ragatouille)
  Downloading fast_pytorch_kmeans-0.2.0.1-py3-none-any.whl.metadata (1.1 kB)
Collecting llama-index>=0.7 (from ragatouille)
  Downloading llama_index-0.12.5-py3-none-any.whl.metadata (11 kB)
Collecting onnx<2.0.0,>=1.15.0 (from ragatouille)
  Downloading onnx-1.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting sentence-transformers<3.0.0,>=2.2.2 (from ragatoui

In [None]:
from datasets import load_dataset

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
ds=load_dataset('parquet', data_files='/content/drive/MyDrive/big data/processed_images.parquet')
output_text=ds['train'].to_pandas()['output_text']
output_text=output_text.apply(lambda x: x[0])
output_text=output_text.values.tolist()


Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
len(output_text)

500

In [None]:
!pip uninstall --y faiss-cpu & pip install faiss-gpu

Found existing installation: faiss-cpu 1.9.0.post1
Uninstalling faiss-cpu-1.9.0.post1:
  Successfully uninstalled faiss-cpu-1.9.0.post1
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [None]:
from ragatouille import RAGPretrainedModel
from ragatouille.utils import get_wikipedia_page

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
index_path = RAG.index(index_name="my_index", collection=output_text, use_faiss=True)

  self.scaler = torch.cuda.amp.GradScaler()




[Dec 15, 01:53:18] #> Creating directory .ragatouille/colbert/indexes/my_index 


[Dec 15, 01:53:19] [0] 		 #> Encoding 1251 passages..


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


[Dec 15, 01:53:29] [0] 		 avg_doclen_est = 150.8784942626953 	 len(local_sample) = 1,251
[Dec 15, 01:53:29] [0] 		 Creating 4,096 partitions.
[Dec 15, 01:53:29] [0] 		 *Estimated* 188,748 embeddings.
[Dec 15, 01:53:29] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/my_index/plan.json ..


  sub_sample = torch.load(sub_sample_path)


[Dec 15, 01:53:32] Loading decompress_residuals_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[Dec 15, 01:55:09] Loading packbits_cpp extension (set COLBERT_LOAD_TORCH_EXTENSION_VERBOSE=True for more info)...


If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')


[0.034, 0.032, 0.031, 0.029, 0.03, 0.033, 0.03, 0.03, 0.03, 0.031, 0.029, 0.031, 0.029, 0.031, 0.029, 0.032, 0.028, 0.03, 0.029, 0.029, 0.03, 0.03, 0.029, 0.029, 0.029, 0.031, 0.031, 0.032, 0.03, 0.032, 0.031, 0.033, 0.032, 0.028, 0.032, 0.029, 0.032, 0.029, 0.03, 0.034, 0.031, 0.034, 0.031, 0.032, 0.028, 0.028, 0.03, 0.035, 0.029, 0.03, 0.03, 0.029, 0.03, 0.031, 0.031, 0.03, 0.036, 0.03, 0.032, 0.029, 0.028, 0.031, 0.032, 0.031, 0.031, 0.033, 0.032, 0.031, 0.031, 0.03, 0.033, 0.03, 0.029, 0.031, 0.032, 0.033, 0.035, 0.03, 0.032, 0.034, 0.031, 0.03, 0.032, 0.033, 0.028, 0.034, 0.03, 0.031, 0.028, 0.033, 0.03, 0.032, 0.031, 0.034, 0.029, 0.031, 0.034, 0.03, 0.031, 0.03, 0.03, 0.033, 0.03, 0.031, 0.033, 0.028, 0.029, 0.029, 0.031, 0.03, 0.033, 0.033, 0.033, 0.029, 0.033, 0.03, 0.034, 0.032, 0.029, 0.032, 0.03, 0.031, 0.032, 0.033, 0.029, 0.033, 0.031, 0.031]


0it [00:00, ?it/s]

[Dec 15, 01:56:45] [0] 		 #> Encoding 1251 passages..


  return torch.cuda.amp.autocast() if self.activated else NullContextManager()
1it [00:05,  5.04s/it]
  return torch.load(codes_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 629.96it/s]

[Dec 15, 01:56:50] #> Optimizing IVF to store map from centroids to list of pids..
[Dec 15, 01:56:50] #> Building the emb2pid mapping..
[Dec 15, 01:56:50] len(emb2pid) = 188749



100%|██████████| 4096/4096 [00:00<00:00, 56463.69it/s]

[Dec 15, 01:56:50] #> Saved optimized IVF to .ragatouille/colbert/indexes/my_index/ivf.pid.pt
Done indexing!





In [None]:
from ragatouille import RAGPretrainedModel

query = "Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?"
RAG = RAGPretrainedModel.from_index("/content/.ragatouille/colbert/indexes/my_index")
results = RAG.search(query, k=3)

  self.scaler = torch.cuda.amp.GradScaler()


Loading searcher for index my_index for the first time... This may take a few seconds
[Dec 15, 02:02:02] #> Loading codec...
[Dec 15, 02:02:02] #> Loading IVF...
[Dec 15, 02:02:02] #> Loading doclens...


  centroids = torch.load(centroids_path, map_location='cpu')
  avg_residual = torch.load(avgresidual_path, map_location='cpu')
  bucket_cutoffs, bucket_weights = torch.load(buckets_path, map_location='cpu')
  ivf, ivf_lengths = torch.load(os.path.join(self.index_path, "ivf.pid.pt"), map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 1392.53it/s]

[Dec 15, 02:02:02] #> Loading codes and residuals...



  return torch.load(codes_path, map_location='cpu')
  return torch.load(residuals_path, map_location='cpu')
100%|██████████| 1/1 [00:00<00:00, 71.22it/s]

Searcher loaded!

#> QueryTokenizer.tensorize(batch_text[0], batch_background[0], bsize) ==
#> Input: . Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?, 		 True, 		 None
#> Output IDs: torch.Size([32]), tensor([  101,     1,  2241,  2006,  1996, 10629,  1010,  2054,  2003,  1996,
         4254,  1997,  6149,  2075,  2005,  1042, 13102,  8586,  2025,  5020,
         2000,  1015,  2006,  1996,  3302,  4304,  9874,  1029,   102,   103,
          103,   103], device='cuda:0')
#> Output Mask: torch.Size([32]), tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 0, 0, 0], device='cuda:0')




  return torch.cuda.amp.autocast() if self.activated else NullContextManager()


In [None]:
results

[{'content': '3. **Dotted Blue Line (f_spec ≠ 1, with corr.)**: This line represents the surface density when \\(f_{\\text{spec}}\\) is not equal to 1, with a correction applied. This line has a slightly different slope compared to the dashed blue line, indicating that the correction has a small effect on the surface density.\n\nThe x-axis represents the radius in units of \\(h^{-1} \\, \\text{Mpc}\\), and the y-axis represents the surface density in units of \\(\\log \\Sigma [h^{-2} \\, \\text{Mpc}^{-2}]\\).\n\nThe graph shows that the surface density decreases as the radius increases, and the correction applied to the surface density has a small effect on the overall trend.',
  'score': 20.375,
  'rank': 1,
  'document_id': '64160ad6-a5be-42dc-8973-62c6a20a5f0d',
  'passage_id': 1},
 {'content': 'The graph in the image is a plot of the surface density of a hypothetical object as a function of its radius. The surface density is measured in units of \\(\\log \\Sigma [h^{-2} \\, \\text{

#### RAG Pipeline Qwen

In [None]:
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model_name = "Qwen/Qwen2-VL-2B-Instruct"
model = Qwen2VLForConditionalGeneration.from_pretrained(model_name,
                                                        trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)

The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  4.77it/s]


In [None]:
item=ds['train'][0]

In [None]:
item

{'query': 'Based on the graph, what is the impact of correcting for fspec not equal to 1 on the surface density trend?',
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1621x1191>,
 'image_filename': 'images/1810.10511_2.jpg',
 'options': "['A. Correction causes a significant increase in surface density across all radii.', 'B. Correction results in a decrease in surface density for larger radii.', 'C. Correction causes the surface density to converge with the fspec = 1 case at larger radii.', 'D. Correction does not affect the surface density trend at all.', '-']",
 'answer': 'C',
 'page': '',
 'model': 'gpt4V',
 'prompt': '',
 'source': 'arxiv_qa'}

In [None]:
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True)

# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": item['image'],
#                 "max_pixels": 720**2,
#             },
#             {
#                 "type": "text",
#                 "text":
#                     item['query'] +
#                     "\n  Choose the correct answer from the options below: \n" +
#                     item['options'] +
#                     "Answer with the letter of the correct option."

#             },
#         ],
#     }
# ]

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": item['image'],
                "max_pixels": 720**2,
            },
            {
                "type": "text",
                "text":
                    "Based on the image, provide a detailed scientific description of the graph."

            },
        ],
    }
]

In [None]:
print(messages)

[{'role': 'user', 'content': [{'type': 'image', 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=1621x1191 at 0x3B61D7C90>, 'max_pixels': 518400}, {'type': 'text', 'text': 'Based on the image, provide a detailed scientific description of the graph.'}]}]


In [None]:
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

In [None]:
print(text)

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Based on the image, provide a detailed scientific description of the graph.<|im_end|>
<|im_start|>assistant



In [None]:
image_inputs, video_inputs = process_vision_info(messages)

In [None]:
image_inputs

[<PIL.Image.Image image mode=RGB size=812x616>]

In [None]:
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(device)

In [None]:
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)


In [None]:
print(output_text[0])

The graph in the image is a plot of the surface density of a hypothetical object as a function of its radius. The x-axis represents the radius of the object in units of h^{-1} Mpc, while the y-axis represents the surface density in units of h^2 Mpc^{-2}. The surface density is calculated using the formula:

\[ \Sigma = \frac{1}{4\pi R^2} \]

where \( R \) is the radius of the object.

There are three lines on the graph:

1. **Black line (f_spec = 1)**: This line represents the surface density when the factor \( f_{\text{spec}} \) is equal to 1. This line is a straight line with a slope of -1, indicating that the surface density decreases as the radius increases.

2. **Blue dashed line (f_spec ≠ 1, w/o corr.)**: This line represents the surface density when the factor \( f_{\text{spec}} \) is not equal to 1, but without any correction. This line is a curved line with a slope that is not -1, indicating that the surface density decreases more steeply as the radius increases.

3. **Blue so

#### RAG pipeline Gemma

In [None]:
# model_id = "google/gemma-2-2b-it"
model_id = "models/gemma-2-2b-it"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device=device
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00,  4.64it/s]


In [None]:
messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]

outputs = pipeline(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)
# Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world.  So, what be yer pleasure, eh? 🦜

Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas.  I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world.  So, what be yer pleasure, eh? 🦜


#### **Parte 1: Multimodal Retrieval**


3. **Implementazione Approcci**:
   - Multimodale con modelli avanzati.
   - Text-only con late interaction e late chunking.

4. **Pipeline di QA**: Risoluzione multiple-choice con un modello generativo.

5. **Valutazione**:
   - Metriche di retrieval.
   - Accuratezza su QA.

6. **Visualizzazione**: Risultati e grafici comparativi.

#### **Parte 2: GraphRAG**
1. **Setup**: Installazione librerie (Neo4j, Milvus, LangGraph).
2. **Estrazione Triple**:
   - LLM per nodi e archi.
   - Scelta tra aggiunta di forza relazionale o descrizioni.
3. **Costruzione del Grafo**:
   - Inserimento in Neo4j.
4. **Interrogazione e Reasoning**:
   - Community detection, vicini diretti, query Cypher.
   - Generazione risposte locali e globali.
5. **Demo Interattiva**:
   - Domande multi-hop con risposte strutturate.
6. **Visualizzazione**:
   - Visualizzazione del grafo e delle risposte.

---


### **4. Output Finale**
- **Modulo 1**: Report quantitativo sulle prestazioni dei modelli di retrieval nel task multimodale.
- **Modulo 2**: Grafo di conoscenza interattivo con risposte a domande complessive e reasoning complesso.