Evaluate-Models

Comprehensive evaluation framework for the **DriveFusionQA** model, a vision-language model designed to answer driving-related questions with visual context.

Overview

This repository provides tools and utilities to evaluate the DriveFusionQA model's performance on various driving-related question-answering datasets. The framework uses state-of-the-art evaluation metrics including Lingo-Judge to assess answer quality and truthfulness.

Features

Qwen2.5-VL Model Evaluation: Evaluate Qwen2.5-VL-3B-Instruct with optional LoRA fine-tuning adapters
Extensible Evaluator Framework: Base Evaluator class supporting multiple vision-language models
Multiple Evaluation Metrics: Built-in support for BLEU, METEOR, CIDER, BERTScore, MQA, Prometheus, and Lingo-Judge
LingoQA Benchmark Support: Optimized evaluation for the LingoQA driving-related QA dataset
Flexible Model Loading: Singleton pattern for efficient model management with CUDA support
Batch Processing: Process multiple datasets and generate comprehensive evaluation reports
Configurable Parameters: Easy customization of model checkpoints, dataset paths, and output locations

Installation

Clone the repository:

git clone https://github.com/DriveFusion/Evaluate-Models.git
cd Evaluate-Models

Install dependencies:

pip install -r requirements.txt

Usage

Basic Evaluation

from drivefusion_evaluate import evaluate_qwen2_5_vl, show_score

# Run evaluation with LoRA adapter
evaluate_qwen2_5_vl(
    datasets_path="/path/to/datasets",
    results_path="./results",
    images_path=["/path/to/images"],
    adaptor_path="/path/to/model/checkpoint"
)

# Or evaluate base model (without adapter)
evaluate_qwen2_5_vl(
    datasets_path="/path/to/datasets",
    results_path="./results",
    images_path=["/path/to/images"],
    adaptor_path=None
)

# Display evaluation results
show_score("./results/lingoqa_eval_lingo_judge_score.csv")

Advanced Configuration

from drivefusion_evaluate.models import Qwen2_5_VLLora

# Custom model initialization with specific parameters
model = Qwen2_5_VLLora(
    adapter_path="/path/to/checkpoint",
    model_name="Qwen/Qwen2.5-VL-3B-Instruct",  # Default
    device_map="cuda"
)

# Generate predictions on custom data
message = [{"role": "user", "content": [...]}]
prediction = model.generate_text_from_message(message, max_new_tokens=1024)

Configuration

Main parameters in evaluate.py:

datasets_path: Path to evaluation datasets (JSON format with questions and answers)
results_path: Directory to save evaluation results and scores
images_path: List of image directories corresponding to dataset images
adaptor_path: Path to LoRA adapter checkpoint or None for base model evaluation

Model parameters in drivefusion_evaluate/models/qwen2_5_vl_lora.py:

model_name: HuggingFace model identifier (default: Qwen/Qwen2.5-VL-3B-Instruct)
device_map: Device for model loading (default: "cuda")
max_new_tokens: Maximum tokens to generate per prediction (default: 1024)

Project Structure

drivefusion_evaluate/
├── evaluate.py                      # Main evaluation entry point
├── constants.py                     # Project paths and constants
├── evaluators/
│   ├── __init__.py
│   ├── evaluator.py                # Abstract base evaluator class
│   └── qwen2_5_vl_evaluator.py     # Qwen2.5-VL implementation
├── models/
│   ├── __init__.py
│   └── qwen2_5_vl_lora.py          # Qwen2.5-VL with LoRA support (singleton)
└── evaluation_metrics/
    ├── __init__.py
    ├── lingo_judge.py              # Lingo-Judge metric
    ├── bleu.py                     # BLEU metric
    ├── meteor.py                   # METEOR metric
    ├── cider.py                    # CIDER metric
    ├── bertscore.py                # BERTScore metric
    ├── mqa.py                      # MQA metric
    └── prometheus.py               # Prometheus metric

Requirements

Python 3.10+
PyTorch with CUDA 11.8 support
Transformers library
Qwen VL utilities and processor
PEFT (Parameter-Efficient Fine-Tuning) for LoRA support
Additional evaluation dependencies (see requirements.txt)

Key Dependencies

torch & torchvision: Deep learning framework
transformers: Model loading and processing
qwen-vl-utils: Qwen-specific vision utilities
peft: LoRA adapter loading
scikit-learn: ML utilities for metrics
pandas: Result aggregation and reporting
tqdm: Progress bars

Evaluation Metrics

Available Metrics

The framework includes multiple evaluation metrics for comprehensive model assessment:

Lingo-Judge: Specialized textual classifier evaluating truthfulness and accuracy of answers on the LingoQA benchmark with normalized scores (0-1)
BLEU: Bilingual Evaluation Understudy - measures n-gram overlap with reference answers
METEOR: Metric for Evaluation of Translation with Explicit ORdering - includes semantic similarity
CIDER: Consensus-based Image Description Evaluation - optimized for visual QA tasks
BERTScore: Leverages BERT embeddings for semantic similarity evaluation
MQA: Multimodal Question Answering metric
Prometheus: Advanced LLM-based evaluation using language models for quality assessment

Result Aggregation

The show_score() function aggregates Lingo-Judge scores across all predictions and displays average model performance as a percentage:

show_score("./results/lingoqa_eval_lingo_judge_score.csv")
# Output: model score: 85.42

Support

For issues, questions, or contributions, please contact the DriveFusion team.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
drivefusion_evaluate		drivefusion_evaluate
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluate-Models

Overview

Features

Installation

Usage

Basic Evaluation

Advanced Configuration

Configuration

Project Structure

Requirements

Key Dependencies

Evaluation Metrics

Available Metrics

Result Aggregation

Support

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

DriveFusion/Evaluate-Models

Folders and files

Latest commit

History

Repository files navigation

Evaluate-Models

Overview

Features

Installation

Usage

Basic Evaluation

Advanced Configuration

Configuration

Project Structure

Requirements

Key Dependencies

Evaluation Metrics

Available Metrics

Result Aggregation

Support

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages