In [1]:
%load_ext autoreload
%autoreload 2

## Note: the code is fully reproducible:
- gathering a benchmark is deterministic, no randomness involved
- all models are inferenced with temperature=0 (greedy decoding)
- see .venv libraries at requirements.txt

---

# Imports

In [2]:
from collections import defaultdict
import json
import os

from datasets import load_dataset
import dotenv
import pandas as pd
import numpy as np
from tabulate import tabulate
from tqdm import tqdm
dotenv.load_dotenv()

from consts import LLAVA_MODEL_ID, QWEN_2_5_VL_MODEL_ID
from utils import (
    select_different_products_by_category, download_images,
    predict, load_results, parse_price, compute_metrics,
    calculate_blue_and_rouge_scores, ModelsLoader
)

# Load `llava-hf/llava-1.5-7b-hf` and `Qwen/Qwen2.5-VL-3B-Instruct` into RAM
models_loader = ModelsLoader()
models_loader.load_model_and_processor(LLAVA_MODEL_ID)
models_loader.load_model_and_processor(QWEN_2_5_VL_MODEL_ID);

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

___

# Task 0. Load a dataset & construct a benchmark

In [3]:
ds = load_dataset("milistu/AMAZON-Products-2023")
dataframe: pd.DataFrame = ds["train"].to_pandas()
dataframe.filename.value_counts().iloc[:5]

filename
meta_Clothing_Shoes_and_Jewelry    41777
meta_Home_and_Kitchen              17326
meta_Electronics                    7681
meta_Automotive                     7161
meta_Sports_and_Outdoors            6343
Name: count, dtype: int64

It can bee seen that the filenames (which de-facto represent the parent categories) are not balanced (30 different categories).

To create a representative benchmark, we need to take into account each category.

Let's take 40 samples of each category if there are at least 40, otherwise take all of them.

How do we select 40 samples from one category? Random sampling can be misleading, because we can sample 40 cell phones in the category `meta_Electronics`, which won't be representative.

It's better to sample different products within one filename (for example, cell phone, headphones, TV etc from the category `meta_Electronics`).

To achieve this, we can utilize the column `embeddings` (embeddings generated for the product using text-embedding-3-small model.)




In [4]:
benchmark = pd.concat(
    # THIS CODE IS DETERMINISTIC !!!
    [select_different_products_by_category(dataframe, category, 40) for category in dataframe['filename'].unique()]
)
benchmark.reset_index(drop=True)

download_images(benchmark, "image", save_dir="data/images")

benchmark.shape

(1085, 16)

Downloading images: 0it [00:00, ?it/s]


# Benchmark size: 1085

-----

# Task 1. Regression task: predict product price

## Metrics to use: MAPE (good) + MAE and MSE (not good)
### Why MAPE?
Product prices can vary a lot (for example, $10, $100, $1000, etc.). Because of this, metrics like MAE and MSE are not suitable, though can be measured

MAPE is the best choice because it shows how close the model's prediction is to the actual price, and it is not affected by the magnitude of the price.


### Input: title, store, description, rating, main_category, details and image
### Output: price

In [5]:
for model_id in [LLAVA_MODEL_ID, QWEN_2_5_VL_MODEL_ID]:

    model_id_name = model_id.split("/")[-1]
    results = predict(
        models_loader=models_loader,
        model_id=model_id,
        benchmark=benchmark,
        inference_batch_size=1,
        task="price"
    )

INFO:utils:Loaded 1085 cached results for llava-1.5-7b-hf
INFO:utils:Predicting price for 0 samples
INFO:utils:Saved 1085 predictions for llava-1.5-7b-hf to data/results/llava-1.5-7b-hf_price_predictions.json
INFO:utils:Loaded 1085 cached results for Qwen2.5-VL-3B-Instruct
INFO:utils:Predicting price for 0 samples
INFO:utils:Saved 1085 predictions for Qwen2.5-VL-3B-Instruct to data/results/Qwen2.5-VL-3B-Instruct_price_predictions.json


### Parse results

Some predictions may not be parsed correctly. Since there are few such examples, let's drop them from the benchmark for fair comparison of the models

In [6]:
llava_results = load_results(f"data/results/{LLAVA_MODEL_ID.split("/")[-1]}_price_predictions.json")
qwen_results = load_results(f"data/results/{QWEN_2_5_VL_MODEL_ID.split("/")[-1]}_price_predictions.json")

llava_predictions = []
qwen_predictions = []
ground_truth = []

llava_predictions_per_category = defaultdict(list)
qwen_predictions_per_category = defaultdict(list)
ground_truth_per_category = defaultdict(list)

for parent_asin in benchmark["parent_asin"]:

    ground_truth_price: float = benchmark.loc[benchmark["parent_asin"] == parent_asin, "price"].values[0].item()
    if ground_truth_price <= 0:  # this should no happen, but this happens and result in inf MAPE
        continue

    category: str = benchmark.loc[benchmark["parent_asin"] == parent_asin, "filename"].values[0]

    llava_pred: str = llava_results[parent_asin]
    qwen_pred: str = qwen_results[parent_asin]

    llava_pred_parsed: float | None = parse_price(llava_pred)
    qwen_pred_parsed: float | None = parse_price(qwen_pred)

    if not llava_pred_parsed or not qwen_pred_parsed:
        continue # skip some (invalid) samples from the benchmark

    llava_predictions.append(llava_pred_parsed)
    qwen_predictions.append(qwen_pred_parsed)
    ground_truth.append(ground_truth_price)

    llava_predictions_per_category[category].append(llava_pred_parsed)
    qwen_predictions_per_category[category].append(qwen_pred_parsed)
    ground_truth_per_category[category].append(ground_truth_price)

print("FINAL BENCHMARK SIZE FOR THE PRICE PREDICTION TASK (some samples are dropped due to invalid predictions or zero price):", len(ground_truth))

FINAL BENCHMARK SIZE FOR THE PRICE PREDICTION TASK (some samples are dropped due to invalid predictions or zero price): 1043


In [7]:
from tabulate import tabulate

llava_metrics = compute_metrics(ground_truth, llava_predictions)
qwen_metrics = compute_metrics(ground_truth, qwen_predictions)

const_predictions_mean_metrics = compute_metrics(ground_truth, [np.mean(ground_truth)] * len(ground_truth))
const_predictions_median_metrics = compute_metrics(ground_truth, [np.median(ground_truth)] * len(ground_truth))

# find the best constant prediction that minimizes MAPE
increment: float = 0.1
lowest_mape = 1e9
best_c = 0
c = 0.1
for _ in range(10_000):
    c += increment
    metrics = compute_metrics(ground_truth, [c] * len(ground_truth))
    if metrics["MAPE"] < lowest_mape:
        lowest_mape = metrics["MAPE"]
        best_c = c


table = [
    ["LLaVA"] + [llava_metrics["MAPE"], llava_metrics["MAE"], llava_metrics["MSE"]],
    ["Qwen"] + [qwen_metrics["MAPE"], qwen_metrics["MAE"], qwen_metrics["MSE"]],
    ["Constant (mean)"] + [const_predictions_mean_metrics["MAPE"], const_predictions_mean_metrics["MAE"], const_predictions_mean_metrics["MSE"]],
    ["Constant (median)"] + [const_predictions_median_metrics["MAPE"], const_predictions_median_metrics["MAE"], const_predictions_median_metrics["MSE"]],   
    ["Constant (lowest MAPE)"] + [compute_metrics(ground_truth, [best_c] * len(ground_truth))["MAPE"], compute_metrics(ground_truth, [best_c] * len(ground_truth))["MAE"], compute_metrics(ground_truth, [best_c] * len(ground_truth))["MSE"]],   
]
headers = ["Model", "MAPE (%)", "MAE", "MSE"]
print(tabulate(table, headers=headers, tablefmt="github"))

| Model                  |   MAPE (%) |   MAE |      MSE |
|------------------------|------------|-------|----------|
| LLaVA                  |      81.6  | 27.04 |  5669.12 |
| Qwen                   |      59.33 | 23.6  |  5236.79 |
| Constant (mean)        |     188.12 | 39.62 |  9254.71 |
| Constant (median)      |      70.89 | 31.59 |  9844.34 |
| Constant (lowest MAPE) |      56.15 | 34.08 | 10309.2  |


### Conclusions
- `Qwen/Qwen2.5-VL-3B-Instruct` shows better performance than `llava-hf/llava-1.5-7b-hf` on all metrics
- 59.33% MAPE is still extremely high and further fine-tuning is necessary
- vLLMs show better performance in comparison with the constant classifiers


## Errors analysis

In [8]:
rows = []
for category, ground_truth in ground_truth_per_category.items():
    if len(ground_truth) < 40:
        continue
    current_llava_metrics = compute_metrics(ground_truth, llava_predictions_per_category[category])
    current_qwen_metrics = compute_metrics(ground_truth, qwen_predictions_per_category[category])
    gt_mean = np.mean(ground_truth)
    gt_var = np.std(ground_truth)

    rows.append([
        category,
        current_llava_metrics["MAPE"], current_llava_metrics["MAE"], current_llava_metrics["MSE"],
        current_qwen_metrics["MAPE"], current_qwen_metrics["MAE"], current_qwen_metrics["MSE"],
        round(gt_mean, 2), round(gt_var, 2)
    ])

# Sort by Qwen LLaVa (index 1 in the row)
rows.sort(key=lambda row: row[1])

headers = [
    "Category",
    "LLaVA MAPE (%)", "LLaVA MAE", "LLaVA MSE",
    "Qwen MAPE (%)", "Qwen MAE", "Qwen MSE",
    "GT Mean", "GT std"
]
print(tabulate(rows, headers=headers, tablefmt="github"))
print()

# compute correlation between MAPE and GT mean / GT std (seems logical to me)
qwen_mape = [row[4] for row in rows]
llava_mape = [row[1] for row in rows]
gt_mean = [row[7] for row in rows]
gt_std = [row[8] for row in rows]

correlation_matrix_std_qwen = np.corrcoef(qwen_mape, gt_std)
correlation_matrix_mean_qwen = np.corrcoef(qwen_mape, gt_mean)
correlation_matrix_std_llava = np.corrcoef(llava_mape, gt_std)
correlation_matrix_mean_llava = np.corrcoef(llava_mape, gt_mean)

print(f"Correlation between Qwen MAPE and GT std: {round(correlation_matrix_std_qwen[0][1], 3)}")
print(f"Correlation between Qwen MAPE and GT mean: {round(correlation_matrix_mean_qwen[0][1], 3)}")
print()
print(f"Correlation between LLaVA MAPE and GT std: {round(correlation_matrix_std_llava[0][1], 3)}")
print(f"Correlation between LLaVA MAPE and GT mean: {round(correlation_matrix_mean_llava[0][1], 3)}")

| Category                        |   LLaVA MAPE (%) |   LLaVA MAE |   LLaVA MSE |   Qwen MAPE (%) |   Qwen MAE |   Qwen MSE |   GT Mean |   GT std |
|---------------------------------|------------------|-------------|-------------|-----------------|------------|------------|-----------|----------|
| meta_Digital_Music              |            24.63 |        6.52 |      130.08 |           26.25 |       6.56 |     141.26 |     22.05 |    12.89 |
| meta_Grocery_and_Gourmet_Food   |            57.09 |       11.63 |      320.56 |           49.21 |      11.06 |     318.69 |     26.97 |    42.5  |
| meta_Video_Games                |            60.01 |       17.68 |      689.51 |           49.59 |      16.11 |     622.17 |     38.81 |    37.77 |
| meta_Health_and_Household       |            62.29 |       17.82 |      920.28 |           43.4  |      17.44 |    1288.65 |     35.1  |    52.31 |
| meta_Handmade_Products          |            64.24 |       27.46 |     3176.68 |           61.52 |

The table above shows that the score across categories vary significantly, which implies that some categories are harder to deal with (like `Toys and Games` or `Pet Supplies`), and some are pretty easy (`meta_Digital_Music`). Further analysis (to determine the underlying cause of it) is required.

Also, if we look closer to the data itself, we can notice that many samples are absolutely bizarre: the price does not look correct at all.

The noise in the data contributes to the overall performance.

---

# Task 2


## Task 2.1
- Input: description, store, price, rating, main_category, details and image
- Output: title

In [9]:
for model_id in [LLAVA_MODEL_ID, QWEN_2_5_VL_MODEL_ID]:

    model_id_name = model_id.split("/")[-1]
    results = predict(
        models_loader=models_loader,
        model_id=model_id,
        benchmark=benchmark,
        inference_batch_size=1,
        task="title",
        use_images=True,
        do_sample=False,
        max_new_tokens=10
    )

calculate_blue_and_rouge_scores(benchmark=benchmark, task="title")

INFO:utils:Loaded 1085 cached results for llava-1.5-7b-hf
INFO:utils:Predicting title for 0 samples
INFO:utils:Saved 1085 predictions for llava-1.5-7b-hf to data/results/llava-1.5-7b-hf_title_predictions.json
INFO:utils:Loaded 1085 cached results for Qwen2.5-VL-3B-Instruct
INFO:utils:Predicting title for 0 samples
INFO:utils:Saved 1085 predictions for Qwen2.5-VL-3B-Instruct to data/results/Qwen2.5-VL-3B-Instruct_title_predictions.json
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.


Unnamed: 0_level_0,BLEU,ROUGE-L
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
llava-hf/llava-1.5-7b-hf,0.089,0.385
Qwen/Qwen2.5-VL-3B-Instruct,0.102,0.419


## `Qwen2.5-VL-3B-Instruct` is better in predicting title than `llava-1.5-7b-hf`

# Try to remove image modality and see the result

In [10]:
model_id = QWEN_2_5_VL_MODEL_ID
model_id_name = model_id.split("/")[-1]
results = predict(
    models_loader=models_loader,
    model_id=model_id,
    benchmark=benchmark,
    inference_batch_size=1,
    task="title",
    use_images=False,
    do_sample=False,
    max_new_tokens=10
)

calculate_blue_and_rouge_scores(benchmark=benchmark, task="title", use_images=False, models_to_eval=[model_id])

INFO:utils:Loaded 1085 cached results for Qwen2.5-VL-3B-Instruct
INFO:utils:Predicting title for 0 samples
INFO:utils:Saved 1085 predictions for Qwen2.5-VL-3B-Instruct to data/results/Qwen2.5-VL-3B-Instruct_title_no_images_predictions.json
INFO:absl:Using default tokenizer.


Unnamed: 0_level_0,BLEU,ROUGE-L
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
Qwen/Qwen2.5-VL-3B-Instruct,0.073,0.355


# Conclusion: BLUE and ROUGE decreased when I disabled image modality!

## `-> image matters`

___

## Task 2.2
- Input: title, main_category, n_words** and image
- Output: description

[!] **Since the ground truth description can be of any length, predicting a random description is not really fair (blue and rough will fail). To this end, I also specify the desired length of the description (in terms of the number of words)

In [11]:
for model_id in [LLAVA_MODEL_ID]: # QWEN_2_5_VL_MODEL_ID

    model_id_name = model_id.split("/")[-1]
    results = predict(
        models_loader=models_loader,
        model_id=model_id,
        benchmark=benchmark,
        inference_batch_size=1,
        task="description",
        use_images=True,
        do_sample=False,
        max_new_tokens=200  # some descriptions can be quite long
    )

calculate_blue_and_rouge_scores(benchmark=benchmark, task="description")

INFO:utils:Loaded 1085 cached results for llava-1.5-7b-hf
INFO:utils:Predicting description for 0 samples
INFO:utils:Saved 1085 predictions for llava-1.5-7b-hf to data/results/llava-1.5-7b-hf_description_predictions.json
INFO:absl:Using default tokenizer.
INFO:absl:Using default tokenizer.


Unnamed: 0_level_0,BLEU,ROUGE-L
Model,Unnamed: 1_level_1,Unnamed: 2_level_1
llava-hf/llava-1.5-7b-hf,0.03,0.222
Qwen/Qwen2.5-VL-3B-Instruct,0.04,0.236


## `Qwen2.5-VL-3B-Instruct` is slightly better in predicting description than `llava-1.5-7b-hf`
*but these metrics really show nothing, since so many descriptions can be totally fine. Blue and rouge cannot evaluate it correctly.

# Conclusion:

`Qwen/Qwen2.5-VL-3B-Instruct` outperformed `llava-hf/llava-1.5-7b-hf` again on both BLUE and ROUGE-L scores for both title prediction and description prediction tasks

----

# LLM-as-a-Judge (via `Qwen/Qwen3-0.6B` -- poor choice, but fast and cheap)

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM
from consts import LLM_AS_A_JUDGE_PROMPT_TEMPLATE, LLM_AS_A_JUDGE_MODEL_ID

model = AutoModelForCausalLM.from_pretrained(LLM_AS_A_JUDGE_MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(LLM_AS_A_JUDGE_MODEL_ID)
tokenizer.padding_side = "left"

In [13]:
TASKS = ["title", "description"]
MODEL_IDS = [LLAVA_MODEL_ID, QWEN_2_5_VL_MODEL_ID]


def parse_llm_evaluation(evaluation_scores: list[str]) -> list[float]:
    scores: list[float] = []
    for score in evaluation_scores:
        score_splitted = score.split()
        value_parsed: bool = False
        for s in score_splitted:
            try:
                score_value = float(s)
                scores.append(score_value)
                value_parsed = True
                break
            except ValueError:
                continue
        if not value_parsed:
            scores.append(0)
            print(f"WARNING: Failed to parse score: {score}")
    return scores


for task in TASKS:
    for model_id in MODEL_IDS:
        model_id_name = model_id.split("/")[-1]
        llm_eval_scores_fp: str = f"data/results/llm_eval_{model_id_name}_{task}.jsonl"
        if os.path.exists(llm_eval_scores_fp):
            print(f"Skipping evaluation of {model_id_name} on {task} because it already exists")
            continue

        llm_evaluation_scores: list[str] = []
        results_fp: str = f"data/results/{model_id_name}_{task}_predictions.json"
        results: dict[str, str] = load_results(results_fp) # results = dict[parent_asin] = predicted_<task>
        results_list: list[str] = list(results.values())
        results_parent_asins: list[str] = list(results.keys())

        inference_batch_size: int = 64
        for i in tqdm(range(0, len(results), inference_batch_size), desc=f"Evaluating {model_id_name} on {task}"):
            batch = results_list[i:i+inference_batch_size]
            parent_asins = results_parent_asins[i:i+inference_batch_size]
            chats = [
                [ 
                    {
                        "role": "user",
                        "content": LLM_AS_A_JUDGE_PROMPT_TEMPLATE.format(
                            task=task,
                            original=benchmark.loc[benchmark["parent_asin"] == parent_asin, task].values[0],
                            predicted=results[parent_asin]
                        )
                    } 
                ] for parent_asin in parent_asins
            ]
            inputs = tokenizer.apply_chat_template(
                chats,
                add_generation_prompt=True,
                tokenize=True,
                return_dict=True,
                return_tensors="pt",
                padding=True,
                enable_thinking=False
            ).to(model.device)

            outputs = model.generate(**inputs, max_new_tokens=5, do_sample=False)
            decoded_results = tokenizer.batch_decode([
                output[inputs["input_ids"].shape[-1]:]
                for output in outputs
            ], skip_special_tokens=True)
            llm_evaluation_scores.extend([
                {"parent_asin": parent_asin, "score": score}
                for parent_asin, score in zip(parent_asins, parse_llm_evaluation(decoded_results))
            ])

        # save llm score to /data/results/llm_eval_{model_id_name}_{task}.jsonl (one line = JSON per parent_asin and score)
        with open(llm_eval_scores_fp, "w") as f:
            for entry in llm_evaluation_scores:
                f.write(json.dumps(entry) + "\n")

Skipping evaluation of llava-1.5-7b-hf on title because it already exists
Skipping evaluation of Qwen2.5-VL-3B-Instruct on title because it already exists
Skipping evaluation of llava-1.5-7b-hf on description because it already exists
Skipping evaluation of Qwen2.5-VL-3B-Instruct on description because it already exists


In [14]:
for task in TASKS:
    table = []
    headers = ["Model", f"Average LLM Score ({task})"]
    for model_id in MODEL_IDS:
        model_id_name = model_id.split("/")[-1]
        llm_eval_scores_fp: str = f"data/results/llm_eval_{model_id_name}_{task}.jsonl"
        with open(llm_eval_scores_fp, "r") as f:
            llm_evaluation_scores = [json.loads(line) for line in f]
            average_score = np.mean([e["score"] for e in llm_evaluation_scores])
            table.append([model_id_name, average_score])

    print(tabulate(table, headers=headers, floatfmt=".3f", tablefmt="github"))
    print()

| Model                  |   Average LLM Score (title) |
|------------------------|-----------------------------|
| llava-1.5-7b-hf        |                       7.597 |
| Qwen2.5-VL-3B-Instruct |                       8.106 |

| Model                  |   Average LLM Score (description) |
|------------------------|-----------------------------------|
| llava-1.5-7b-hf        |                             7.021 |
| Qwen2.5-VL-3B-Instruct |                             7.576 |



### Result: `Qwen2.5-VL-3B-Instruct` wins again (according to LLM-as-a-Judge)

---