Skip to content

Evaluation framework for the DriveFusion DriveFusionQA vision-language model, benchmarking Q&A performance on driving datasets using metrics like Lingo-Judge, BLEU, and BERTScore.

License

Notifications You must be signed in to change notification settings

DriveFusion/Evaluate-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DriveFusion Logo

Evaluate-Models

Comprehensive evaluation framework for the **DriveFusionQA** model, a vision-language model designed to answer driving-related questions with visual context.

Python License Status


Overview

This repository provides tools and utilities to evaluate the DriveFusionQA model's performance on various driving-related question-answering datasets. The framework uses state-of-the-art evaluation metrics including Lingo-Judge to assess answer quality and truthfulness.

Features

  • Qwen2.5-VL Model Evaluation: Evaluate Qwen2.5-VL-3B-Instruct with optional LoRA fine-tuning adapters
  • Extensible Evaluator Framework: Base Evaluator class supporting multiple vision-language models
  • Multiple Evaluation Metrics: Built-in support for BLEU, METEOR, CIDER, BERTScore, MQA, Prometheus, and Lingo-Judge
  • LingoQA Benchmark Support: Optimized evaluation for the LingoQA driving-related QA dataset
  • Flexible Model Loading: Singleton pattern for efficient model management with CUDA support
  • Batch Processing: Process multiple datasets and generate comprehensive evaluation reports
  • Configurable Parameters: Easy customization of model checkpoints, dataset paths, and output locations

Installation

  1. Clone the repository:
git clone https://github.com/DriveFusion/Evaluate-Models.git
cd Evaluate-Models
  1. Install dependencies:
pip install -r requirements.txt

Usage

Basic Evaluation

from drivefusion_evaluate import evaluate_qwen2_5_vl, show_score

# Run evaluation with LoRA adapter
evaluate_qwen2_5_vl(
    datasets_path="/path/to/datasets",
    results_path="./results",
    images_path=["/path/to/images"],
    adaptor_path="/path/to/model/checkpoint"
)

# Or evaluate base model (without adapter)
evaluate_qwen2_5_vl(
    datasets_path="/path/to/datasets",
    results_path="./results",
    images_path=["/path/to/images"],
    adaptor_path=None
)

# Display evaluation results
show_score("./results/lingoqa_eval_lingo_judge_score.csv")

Advanced Configuration

from drivefusion_evaluate.models import Qwen2_5_VLLora

# Custom model initialization with specific parameters
model = Qwen2_5_VLLora(
    adapter_path="/path/to/checkpoint",
    model_name="Qwen/Qwen2.5-VL-3B-Instruct",  # Default
    device_map="cuda"
)

# Generate predictions on custom data
message = [{"role": "user", "content": [...]}]
prediction = model.generate_text_from_message(message, max_new_tokens=1024)

Configuration

Main parameters in evaluate.py:

  • datasets_path: Path to evaluation datasets (JSON format with questions and answers)
  • results_path: Directory to save evaluation results and scores
  • images_path: List of image directories corresponding to dataset images
  • adaptor_path: Path to LoRA adapter checkpoint or None for base model evaluation

Model parameters in drivefusion_evaluate/models/qwen2_5_vl_lora.py:

  • model_name: HuggingFace model identifier (default: Qwen/Qwen2.5-VL-3B-Instruct)
  • device_map: Device for model loading (default: "cuda")
  • max_new_tokens: Maximum tokens to generate per prediction (default: 1024)

Project Structure

drivefusion_evaluate/
├── evaluate.py                      # Main evaluation entry point
├── constants.py                     # Project paths and constants
├── evaluators/
│   ├── __init__.py
│   ├── evaluator.py                # Abstract base evaluator class
│   └── qwen2_5_vl_evaluator.py     # Qwen2.5-VL implementation
├── models/
│   ├── __init__.py
│   └── qwen2_5_vl_lora.py          # Qwen2.5-VL with LoRA support (singleton)
└── evaluation_metrics/
    ├── __init__.py
    ├── lingo_judge.py              # Lingo-Judge metric
    ├── bleu.py                     # BLEU metric
    ├── meteor.py                   # METEOR metric
    ├── cider.py                    # CIDER metric
    ├── bertscore.py                # BERTScore metric
    ├── mqa.py                      # MQA metric
    └── prometheus.py               # Prometheus metric

Requirements

  • Python 3.10+
  • PyTorch with CUDA 11.8 support
  • Transformers library
  • Qwen VL utilities and processor
  • PEFT (Parameter-Efficient Fine-Tuning) for LoRA support
  • Additional evaluation dependencies (see requirements.txt)

Key Dependencies

  • torch & torchvision: Deep learning framework
  • transformers: Model loading and processing
  • qwen-vl-utils: Qwen-specific vision utilities
  • peft: LoRA adapter loading
  • scikit-learn: ML utilities for metrics
  • pandas: Result aggregation and reporting
  • tqdm: Progress bars

Evaluation Metrics

Available Metrics

The framework includes multiple evaluation metrics for comprehensive model assessment:

  • Lingo-Judge: Specialized textual classifier evaluating truthfulness and accuracy of answers on the LingoQA benchmark with normalized scores (0-1)
  • BLEU: Bilingual Evaluation Understudy - measures n-gram overlap with reference answers
  • METEOR: Metric for Evaluation of Translation with Explicit ORdering - includes semantic similarity
  • CIDER: Consensus-based Image Description Evaluation - optimized for visual QA tasks
  • BERTScore: Leverages BERT embeddings for semantic similarity evaluation
  • MQA: Multimodal Question Answering metric
  • Prometheus: Advanced LLM-based evaluation using language models for quality assessment

Result Aggregation

The show_score() function aggregates Lingo-Judge scores across all predictions and displays average model performance as a percentage:

show_score("./results/lingoqa_eval_lingo_judge_score.csv")
# Output: model score: 85.42

Support

For issues, questions, or contributions, please contact the DriveFusion team.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

Evaluation framework for the DriveFusion DriveFusionQA vision-language model, benchmarking Q&A performance on driving datasets using metrics like Lingo-Judge, BLEU, and BERTScore.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages