#Test SAIL-VL for RBNR dataset

This is a ready-to-use notebook for the use of SAIL-VL on the RBNR dataset. You just need to run the cells, the download of the dataset and the model is also managed, and see how it goes. You can also edit some parameter and the path for dataset if you want to use another one to test SAIL-VL on it.

## SECTION 1 - parameters and dependencies

# SAIL-VL Setup Guide

*First* of all we need to install the dependencies:

**File and Dataset**

* **os** â€“ filesystem operations
* **gdown** â€“ download dataset from Google Drive
* **zipfile / tarfile** â€“ extract compressed files
* **pathlib** â€“ manage file paths
* **PIL (Pillow)** â€“ image loading and processing
* **numpy** â€“ numerical operations

**Model and Inference**

* **torch / torchvision** â€“ deep learning framework and image transformations
* **transformers** â€“ load SAIL-VL model and tokenizer
* **AutoModel / AutoTokenizer** â€“ load pretrained model and tokenizer

**Image Preprocessing**

* **torchvision.transforms** â€“ build image transforms
* **torchvision.transforms.functional** â€“ interpolation methods

**Text and Regex**

* **re** â€“ extract numbers or patterns from output

**Evaluation**

* **scikit-learn (metrics)** â€“ compute precision, recall, F1, confusion matrices




In [None]:
!pip install flash_attn

In [None]:
!pip3 install einops transformers timm
!pip install -q transformers==4.44.2 torch torchvision pillow accelerate


In [1]:
import os
import gdown
import tarfile
import os

import zipfile
import pathlib

import numpy as np
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from transformers import AutoModel, AutoTokenizer, AutoProcessor

import re

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, precision_score, recall_score, f1_score




The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

Now lets lets define some **variables** that are useful for the notebook. Path of the **drive dataset**, **prompt**, **num_token** can be modified here.

In [None]:
dataset_link = 'https://drive.google.com/uc?id=12W-bY7SuctltDqHhl-OkI1DzwrcGnvJM'
dataset_folder = './'  # local folder to save the dataset
dataset_extract_path = './'  # folder to extract files
dataset_images_subfolder = 'cropped_RBNR_bib_dataset/images'  # subfolder with images
labels_path = 'cropped_RBNR_bib_dataset/labels.txt'  # path to the labels file

# MODEL PARAMETERS
model_path = "BytedanceDouyinContent/SAIL-VL-1d5-2B"
dtype = torch.bfloat16  # tensor data type

# INFERENCE PARAMETERS
input_image_size = 448  # input size for the model
max_num_tiles = 10  # maximum number of tiles per image
max_images = 10
max_tokens = 1024
prompt = "What number do you see?"

use_thumbnail = True  # whether to also use the thumbnail version for prediction

# EVALUATION PARAMETERS
predictions_output_path = './predictions.txt'
max_digit_length = 4  # maximum number of digits considered

# IMAGE NORMALIZATION
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


## SECTION 2 ( optional ) - download dataset

I am kinldy hosting the dataset for you on my google drive , I don't know until when... To download it from there I use **gdown** to get the zip, then the **zipfile** library to extract it

In [3]:
os.makedirs(dataset_extract_path, exist_ok=True)

# Function to download a file if it is not already present
def download_if_needed(filename, url):
    file_path = os.path.join(dataset_extract_path, filename)
    if not os.path.exists(file_path):
        print(f"ðŸ“¥ Downloading {filename} from Google Drive...")
        gdown.download(url, file_path, quiet=False)
    return file_path

# Download the files
x_dev_path_compressed = download_if_needed("./dataset.zip", dataset_link)


In [4]:
# Path to the downloaded file
compressed_file = x_dev_path_compressed

os.makedirs(dataset_extract_path, exist_ok=True)

# Extract everything
# Check if the file is a zip file before attempting to open it as tar.gz
with zipfile.ZipFile(compressed_file, "r") as zip_ref:
    zip_ref.extractall(path=dataset_extract_path)

print(f"âœ… Files extracted to: {dataset_extract_path}")

âœ… Files extracted to: ./


# SECTION 3 - Load model and inference


In [5]:
def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=10, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=10):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=dtype,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
A new version of the following files was downloaded from https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1d5-2B:
- configuration_aimv2.py
- configuration_qwen2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
You are using a model of type internvl_chat to instantiate a model of type sailvl. This is not supported for all configurations of models and can yield errors.


modeling_sailvl.py: 0.00B [00:00, ?B/s]

conversation.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1d5-2B:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_qwen2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1d5-2B:
- modeling_qwen2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_aimv2.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1d5-2B:
- modeling_aimv2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/BytedanceDouyinContent/SAIL-VL-1d5-2B:
- modeling_sailvl.py
- conversation.py
- modeling_qwen2.py
- modeling_aimv2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/69.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/8.87k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/790 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

In [None]:
images_path = pathlib.Path(dataset_images_subfolder)
predictions = []



for bib in sorted(os.listdir(images_path)):

  img_path = os.path.join(images_path, bib)
  img = load_image(str(img_path))

  if img is None:
    predictions.append('nan')
    continue


  pixel_values = load_image(img_path, max_num=max_images).to(torch.bfloat16).cuda()
  generation_config = dict(max_new_tokens=max_tokens, do_sample=True) 

  # single-image single-round conversation
  question = f'<image>         {prompt}'
  response = model.chat(tokenizer, pixel_values, question, generation_config)



  if response and len(response) > 0:
      response = re.findall(r'\b\d+\b', response)
  else:
      response = []

  if not response:
    predictions.append('nan')
  else:
    predictions.append(response[0])

  print(predictions[-1])


3638
3637
3
3719
3531
3531
nan
724
184
2583
2605
2605
2605
2605
2605
1463
1463
2251
2078
2078
168
822
898
3648
869
599
311
311
113
858
858
3369
3369
2874
2874
2475
723
163
976
2287
2103
400
755
2747
2925
2649
663
663
80
1777
1777
2845
3331
3645
2692
2692
3523
3527
3548
3614
3588
814
3035
1404
1679
941
941
435
435
3225
3708
2244
3628
3241
560
1478
3633
341
1676
3621
847
58
3331
3638
3638
1183
168
847
3655
168
1676
48
690
478
1442
1442
1703
2908
3637
3276
1527
61074
71652
20431
10168
10910
1026
11245
11040
11103
10220
10246
2001
22216
1317
30793
31577
30513
35031
30416
30270
61539
71517
70433
80344
15657
5022
60452
31474
20927
60351
70322
90135
82007
61423
60511
80635
80453
70511
10679
10165
109933
10190
1140
81
10988
11459
10175
10191
10168
11142
10144
10227
10159
1071
10216
0891
111977
10145
19078
10145
10
11050
10284
10265
38160022389898
10991
81991
8
11110
11187
10691
10330
11453
57
11454
10012
4345
4454
3778
4407
3855
3236
3482
775
775
nan
3855
3482
1518
3764
248
2254
3555
1251
4380

In [10]:
print(predictions)
with open('predictions.txt', 'w') as f:
    for line in predictions:
        f.write("".join(line) + "\n")

['3638', '3637', '3', '3719', '3531', '3531', 'nan', '724', '184', '2583', '2605', '2605', '2605', '2605', '2605', '1463', '1463', '2251', '2078', '2078', '168', '822', '898', '3648', '869', '599', '311', '311', '113', '858', '858', '3369', '3369', '2874', '2874', '2475', '723', '163', '976', '2287', '2103', '400', '755', '2747', '2925', '2649', '663', '663', '80', '1777', '1777', '2845', '3331', '3645', '2692', '2692', '3523', '3527', '3548', '3614', '3588', '814', '3035', '1404', '1679', '941', '941', '435', '435', '3225', '3708', '2244', '3628', '3241', '560', '1478', '3633', '341', '1676', '3621', '847', '58', '3331', '3638', '3638', '1183', '168', '847', '3655', '168', '1676', '48', '690', '478', '1442', '1442', '1703', '2908', '3637', '3276', '1527', '61074', '71652', '20431', '10168', '10910', '1026', '11245', '11040', '11103', '10220', '10246', '2001', '22216', '1317', '30793', '31577', '30513', '35031', '30416', '30270', '61539', '71517', '70433', '80344', '15657', '5022', '60

# SECTION 4 - evaluation


Now that we have our prediction we evaluate the result in 2 way:


*   **complete number**: basically we count as True Positive only if the number predicted and the label perfectly match
*   **by digit**: instead of evaluating the full number we evaluate the single digits of each number



In [11]:
def evaluate(labels_path: str, predictions_path: str) -> tuple[float, float, float]:

    labels = []
    with open(labels_path, 'r') as f:
        for line in f:
            labels.append(line.strip())

    predictions = []
    with open(predictions_path, 'r') as f:
        for line in f:
            predictions.append(line.strip())

    TP = 0
    FP = 0
    FN = 0
    i= 0
    for label, prediction in zip(labels, predictions):
        i+= 1
        if prediction == 'nan':
            FN +=1
            continue

        if prediction == label:
            TP+=1
            continue

        FP+=1

    P = TP / (TP + FP) if (TP + FP) > 0 else 0.0
    R = TP / (TP + FN) if (TP + FN) > 0 else 0.0
    F1 = 2 * (R * P) / (R + P) if (R + P) > 0 else 0.0
    return P, R, F1


In [12]:


def evaluate_digit(labels_path: str, predictions_path: str) -> tuple[float, float, float]:

    labels = []
    with open(labels_path, 'r') as f:
        for line in f:
            labels.append(line.strip())

    predictions = []
    with open(predictions_path, 'r') as f:
        for line in f:
            predictions.append(line.strip())

    TP = 0
    FP = 0
    FN = 0

    for label, prediction in zip(labels, predictions):
        if prediction == 'nan':
            FN +=1
            continue

        max_len = max(len(label), len(prediction))

        for i in range(max_len):
            true_digit = label[i] if i < len(label) else None
            pred_digit = prediction[i] if i < len(prediction) else None

            if true_digit is not None and pred_digit is not None: #if i can compare them
                if true_digit == pred_digit:
                    TP += 1 # right predition -> TP
                else:
                    FP += 1  # wrong prediction -> FP
            elif true_digit is not None and pred_digit is None:# if i dont predict a digit -> FN
                FN += 1
            elif pred_digit is not None and true_digit is None:
                FP += 1  # if i predict some digit that do not exist -> FP

    P = TP / (TP + FP) if (TP + FP) > 0 else 0.0
    R = TP / (TP + FN) if (TP + FN) > 0 else 0.0
    F1 = 2 * (R * P) / (R + P) if (R + P) > 0 else 0.0
    return P, R, F1
    print(f"PRECISION: {precision_score(y_true=y_true, y_pred=y_pred)}")
    print(f"RECALL: {recall_score(y_true=y_true, y_pred=y_pred)}")
    print(f"F1: {f1_score(y_true=y_true, y_pred=y_pred)}")


In [15]:
P_digit, R_digit, F1_digit = evaluate_digit(labels_path, predictions_output_path)
P_full, R_full, F1_full = evaluate(labels_path, predictions_output_path)

print("===== ðŸ“Š RISULTATI SAIL-VL =====")
print("\nFull number evaluation:")
print(f"Precisione: {P_full*100:.2f}%")
print(f"Recall:     {R_full*100:.2f}%")
print(f"F1-score:   {F1_full*100:.2f}%")

print("\nDigit evaluation:")
print(f"Precisione: {P_digit*100:.2f}%")
print(f"Recall:     {R_digit*100:.2f}%")
print(f"F1-score:   {F1_digit*100:.2f}%")

print("\n=================================")


===== ðŸ“Š RISULTATI SAIL-VL =====

Full number evaluation:
Precisione: 91.67%
Recall:     99.25%
F1-score:   95.31%

Digit evaluation:
Precisione: 94.59%
Recall:     97.78%
F1-score:   96.16%

