#Test idefics2 on RBNR dataset

This is a ready-to-use notebook for the use of idefics2 on the RBNR dataset. You just need to run the cells, the download of the dataset and the model is also managed, and see how it goes. You can also edit some parameter and the path for dataset if you want to use another one to test idefics2 on it.

## SECTION 1 - parameters and dependencies

*First* of all we need to import and install the dependencies:

**File and Dataset**

* **os** ‚Äì filesystem operations
* **gdown** ‚Äì download dataset from Google Drive
* **zipfile / tarfile** ‚Äì extract compressed files
* **pathlib** ‚Äì manage file paths

**Model and Inference**

* **torch / torchvision** ‚Äì deep learning framework and image transformations
* **transformers** ‚Äì load model, processor, tokenizer
* **bitsandbytes** ‚Äì 4-bit quantization support
* **peft** ‚Äì parameter-efficient fine-tuning
* **PIL (Pillow)** ‚Äì image loading and processing
* **io.BytesIO** ‚Äì handle image streams
* **gc** ‚Äì garbage collection for memory management
* **time** ‚Äì manage delays for safe memory clearance

**Image Preprocessing**

* **torchvision.transforms** ‚Äì image transformations
* **torchvision.transforms.functional** ‚Äì interpolation methods
* **transformers.image_utils.load_image** ‚Äì load images from URLs or files

**Text and Regex**

* **re** ‚Äì extract digits or patterns from model outputs
* **json** ‚Äì read/write configuration files (if needed)

**Evaluation**

* **scikit-learn (metrics)** ‚Äì compute precision, recall, F1
* **matplotlib.pyplot** ‚Äì plot results, confusion matrices


In [1]:
!pip install -U bitsandbytes accelerate transformers safetensors
!pip install -U git+https://github.com/huggingface/transformers.git
!pip install -U git+https://github.com/huggingface/peft.git

Collecting bitsandbytes
  Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Downloading bitsandbytes-0.49.0-py3-none-manylinux_2_24_x86_64.whl (59.1 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m59.1/59.1 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.49.0
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-f40551l7
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-f40551l7
  Resolved https://github.com/huggingface/transformers.git to commit 0a8465420eecbac1c6d7dd9f45c08dd96b8c5027
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing meta

In [17]:
import requests
import torch
from PIL import Image
from io import BytesIO

import tarfile
import pathlib
import os
import zipfile
import gdown
import re

import gc
import time

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers import AutoModelForVision2Seq, BitsAndBytesConfig

from transformers.image_utils import load_image




Now lets lets define some **variables** that are useful for the notebook. Path of the **drive dataset**, **prompt**, **num_token** can be modified here.

In [38]:
# PATH AND DATASET

dataset_link = 'https://drive.google.com/uc?id=12W-bY7SuctltDqHhl-OkI1DzwrcGnvJM'
dataset_extract_path = "./"            # folder where to extract the files
dataset_images_subfolder = 'cropped_RBNR_bib_dataset/images'  # subfolder with images
labels_path = 'cropped_RBNR_bib_dataset/labels.txt'   # path to the labels file

# MODEL PARAMETERS

model_name = "HuggingFaceM4/idefics2-8b"  # pretrained model
DEVICE = "cuda:0"                          # device to run the model (GPU)
max_new_tokens = 500                        # maximum number of generated tokens


# INFERENCE PARAMETERS
prompt_text = "What number do you see?"


# OUTPUT PARAMETERS

predictions_output_path = './predictions.txt'  # file to save the predictions


## SECTION 2 ( optional ) - download dataset

I am kinldy hosting the dataset for you on my google drive , I don't know until when... To download it from there I use **gdown** to get the zip, then the **zipfile** library to extract it

In [4]:
os.makedirs(dataset_extract_path, exist_ok=True)

# Function to download a file if it is not already present
def download_if_needed(filename, url):
    file_path = os.path.join(dataset_extract_path, filename)
    if not os.path.exists(file_path):
        print(f"üì• Downloading {filename} from Google Drive...")
        gdown.download(url, file_path, quiet=False)
    return file_path

# Download the files
x_dev_path_compressed = download_if_needed("./dataset.zip", dataset_link)


üì• Downloading ./dataset.zip from Google Drive...


Downloading...
From: https://drive.google.com/uc?id=12W-bY7SuctltDqHhl-OkI1DzwrcGnvJM
To: /content/dataset.zip
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 797k/797k [00:00<00:00, 146MB/s]


In [5]:
# Path to the downloaded file
compressed_file = x_dev_path_compressed

os.makedirs(dataset_extract_path, exist_ok=True)

# Extract everything
# Check if the file is a zip file before attempting to open it as tar.gz
with zipfile.ZipFile(compressed_file, "r") as zip_ref:
    zip_ref.extractall(path=dataset_extract_path)

print(f"‚úÖ Files extracted to: {dataset_extract_path}")


‚úÖ Files extracted to: ./


# SECTION 3 - Load model and inference

In [14]:
processor = AutoProcessor.from_pretrained(model_name)
quant_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    quantization_config=quant_config,
    device_map="auto"
)




Loading weights:   0%|          | 0/763 [00:00<?, ?it/s]

In [8]:
def clear_memory():
    # Delete variables if they exist in the current global scope
    if 'inputs' in globals(): del globals()['inputs']
    if 'model' in globals(): del globals()['model']
    if 'processor' in globals(): del globals()['processor']
    if 'trainer' in globals(): del globals()['trainer']
    if 'peft_model' in globals(): del globals()['peft_model']
    if 'bnb_config' in globals(): del globals()['bnb_config']
    time.sleep(2)

    # Garbage collection and clearing CUDA memory
    gc.collect()
    time.sleep(2)
    torch.cuda.empty_cache()
    torch.cuda.synchronize()
    time.sleep(2)
    gc.collect()
    time.sleep(2)

    print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")



In [9]:
clear_memory()

GPU allocated memory: 0.00 GB
GPU reserved memory: 0.00 GB


In [29]:


images_path = pathlib.Path(dataset_images_subfolder)
predictions = []


for bib in sorted(os.listdir(images_path)):

  img_path = os.path.join(images_path, bib)
  img = load_image(str(img_path))

  if img is None:
    predictions.append('nan')
    continue

  responses = []

  messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": prompt_text},
        ]
    },
  ]

  # First, apply the chat template to get the formatted prompt string
  prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

  # Then, pass the prompt string to the processor along with images
  inputs = processor(text=prompt, images=[img], return_tensors="pt")
  inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

  generated_ids = model.generate(**inputs, max_new_tokens=100)
  response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

  if response and len(response) > 0:
      response = re.findall(r'\b\d+\b', response)
  else:
      response = []

  if not response:
    predictions.append('nan')
  else:
    predictions.append(response[0])

  print(predictions[-1])





3638
3637
3719
3719
3531
3531
979
979
184
2583
2605
2605
2605
2605
2605
463
1463
2251
2078
2078
168
822
898
3648
869
599
311
31
1130
858
858
3369
3369
2874
2874
25
723
163
976
287
2103
400
755
2747
2925
2649
663
663
80
1777
177
2845
331
3645
2692
2692
3523
3527
3548
3614
3588
814
3035
1404
1679
941
941
435
435
3225
3708
2244
3628
3241
560
1478
3633
341
1676
3621
847
58
331
3638
3638
1183
168
847
3655
168
1676
48
690
478
1442
1442
1703
2908
3637
3276
61527
61074
71652
20431
10168
10910
1026
11245
11040
1103
10220
10246
2001
22216
21317
30793
31577
30513
35031
30416
30270
61539
1517
70433
80344
15657
5022
60452
31474
20927
60351
70322
90135
82007
61423
60511
80635
453
70511
10679
10165
10933
10190
1140
10
709988
11
10175
10191
10168
11142
10144
10227
10159
10171
10216
891
1234
10145
19078
10145
10780
11050
nan
10265
10
10971
81991
81991
11110
11187
10691
10330
11453
11453
11454
100
4345
4454
3778
4407
3855
3236
3482
775
775
3594
3855
3482
1518
37641
248
2254
3555
1251
4380
3482
3855
3594

In [32]:
print(predictions)
with open('predictions.txt', 'w') as f:
    for line in predictions:
        f.write("".join(line) + "\n")

['3638', '3637', '3719', '3719', '3531', '3531', '979', '979', '184', '2583', '2605', '2605', '2605', '2605', '2605', '463', '1463', '2251', '2078', '2078', '168', '822', '898', '3648', '869', '599', '311', '31', '1130', '858', '858', '3369', '3369', '2874', '2874', '25', '723', '163', '976', '287', '2103', '400', '755', '2747', '2925', '2649', '663', '663', '80', '1777', '177', '2845', '331', '3645', '2692', '2692', '3523', '3527', '3548', '3614', '3588', '814', '3035', '1404', '1679', '941', '941', '435', '435', '3225', '3708', '2244', '3628', '3241', '560', '1478', '3633', '341', '1676', '3621', '847', '58', '331', '3638', '3638', '1183', '168', '847', '3655', '168', '1676', '48', '690', '478', '1442', '1442', '1703', '2908', '3637', '3276', '61527', '61074', '71652', '20431', '10168', '10910', '1026', '11245', '11040', '1103', '10220', '10246', '2001', '22216', '21317', '30793', '31577', '30513', '35031', '30416', '30270', '61539', '1517', '70433', '80344', '15657', '5022', '60452'

# SECTION 4 - evaluation

Now that we have our prediction we evaluate the result in 2 way:


*   **complete number**: basically we count as True Positive only if the number predicted and the label perfectly match
*   **by digit**: instead of evaluating the full number we evaluate the single digits of each number



In [30]:

def evaluate(labels_path: str, predictions_path: str) -> tuple[float, float, float]:

    labels = []
    with open(labels_path, 'r') as f:
        for line in f:
            labels.append(line.strip())

    predictions = []
    with open(predictions_path, 'r') as f:
        for line in f:
            predictions.append(line.strip())

    TP = 0
    FP = 0
    FN = 0
    for label, prediction in zip(labels, predictions):

        if prediction == 'nan':
            FN +=1
            continue

        if prediction == label:
            TP+=1
            continue

        FP+=1

    P = TP / (TP + FP) if (TP + FP) > 0 else 0.0
    R = TP / (TP + FN) if (TP + FN) > 0 else 0.0
    F1 = 2 * (R * P) / (R + P) if (R + P) > 0 else 0.0
    return P, R, F1


In [31]:

def evaluate_digit(labels_path: str, predictions_path: str) -> tuple[float, float, float]:

    labels = []
    with open(labels_path, 'r') as f:
        for line in f:
            labels.append(line.strip())

    predictions = []
    with open(predictions_path, 'r') as f:
        for line in f:
            predictions.append(line.strip())

    TP = 0
    FP = 0
    FN = 0

    for label, prediction in zip(labels, predictions):


        if prediction == 'nan':
            FN +=1
            continue

        max_len = max(len(label), len(prediction))

        for i in range(max_len):
            true_digit = label[i] if i < len(label) else None
            pred_digit = prediction[i] if i < len(prediction) else None

            if true_digit is not None and pred_digit is not None: #if i can compare them
                if true_digit == pred_digit:
                    TP += 1 # right predition -> TP
                else:
                    FP += 1  # wrong prediction -> FP
            elif true_digit is not None and pred_digit is None:# if i dont predict a digit -> FN
                FN += 1
            elif pred_digit is not None and true_digit is None:
                FP += 1  # if i predict some digit that do not exist -> FP

    P = TP / (TP + FP) if (TP + FP) > 0 else 0.0
    R = TP / (TP + FN) if (TP + FN) > 0 else 0.0
    F1 = 2 * (R * P) / (R + P) if (R + P) > 0 else 0.0
    return P, R, F1
    print(f"PRECISION: {precision_score(y_true=y_true, y_pred=y_pred)}")
    print(f"RECALL: {recall_score(y_true=y_true, y_pred=y_pred)}")
    print(f"F1: {f1_score(y_true=y_true, y_pred=y_pred)}")


In [39]:
P_digit, R_digit, F1_digit = evaluate_digit(labels_path, predictions_output_path)
P_full, R_full, F1_full = evaluate(labels_path, predictions_output_path)

print("===== üìä RISULTATI SAIL-VL =====")
print("\nFull number evaluation:")
print(f"Precisione: {P_full*100:.2f}%")
print(f"Recall:     {R_full*100:.2f}%")
print(f"F1-score:   {F1_full*100:.2f}%")

print("\nDigit evaluation:")
print(f"Precisione: {P_digit*100:.2f}%")
print(f"Recall:     {R_digit*100:.2f}%")
print(f"F1-score:   {F1_digit*100:.2f}%")

print("\n=================================")


===== üìä RISULTATI SAIL-VL =====

Full number evaluation:
Precisione: 88.54%
Recall:     99.22%
F1-score:   93.58%

Digit evaluation:
Precisione: 95.02%
Recall:     97.05%
F1-score:   96.02%

