Skip to content

OpenVisualLab/NexusAD

Repository files navigation

NexusAD Logo

🚗 NexusAD

Exploring the Nexus for Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving

⚠️ Note: The code is currently being updated, stay tuned for more features and improvements.

ECCV 2024 Autonomous Driving Workshop Corner Case Scene Understanding Leaderboard
W-CODA 2024 Challenge Track 1


Team Page License: MIT OpenReview Hugging Face


✍️ Authors

  • Mengjingcheng Mo, Jingxin Wang, Like Wang, Haosheng Chen, Changjun Gu, Jiaxu Leng, Xinbo Gao
    Chongqing University of Posts and Telecommunications

🌟 Project Highlights

  • 🔥 NexusAD introduces a multimodal perception and understanding framework based on InternVL-2.0, significantly improving detection, depth estimation, and reasoning abilities for complex scenarios through fine-tuning on the CODA-LM dataset.
  • 🏁 NexusAD participated in the ECCV 2024 Autonomous Driving Workshop, focusing on multimodal scene understanding tasks in extreme driving scenarios.
    Also participated in the W-CODA 2024 Challenge Track 1.

NexusAD Architecture


📰 Latest News

  • 2024/08/15: NexusAD was submitted to ECCV 2024 and achieved a score of 68.97.
  • 2024/08/15: The NexusAD team released the latest version of the code and LoRA weights.

🚀 Quick Start

Follow these steps to start using NexusAD:

  1. Clone the repository:

    git clone https://github.com/OpenVisualLab/NexusAD.git
    cd NexusAD
  2. Install dependencies:

    pip install -r requirements.txt
  3. Download the CODA-LM Dataset and place it in the specified directory.

  4. Download the LoRA Weights and place them in the weights/ directory.

  5. Run the model:

    python preprocess.py --data_path <path-to-CODA-LM>
    python train.py --config config.json
    python evaluate.py --data_path <path-to-evaluation-set>

⚙️ Model Architecture

The NexusAD model architecture consists of the following components:

  1. Preliminary Visual Perception: Uses Grounding DINO for object detection and DepthAnything v2 for depth estimation, transforming spatial information into easily understandable structured text.

  2. Scene-aware Enhanced Retrieval Generation: Utilizes Retrieval-Augmented Generation (RAG) to retrieve and select relevant samples, enhancing understanding of complex driving scenarios.

  3. Driving Prompt Optimization: Uses Chain-of-Thought (CoT) prompting to generate context-aware, structured driving suggestions.

  4. Fine-tuning: Efficient parameter fine-tuning is performed using LoRA to optimize performance while saving computational resources.


📊 Experimental Results

In the ECCV 2024 Corner Case Understanding task, NexusAD outperformed baseline models, achieving a final score of 68.97:

Model General Perception Regional Perception Driving Suggestions Final Score
GPT-4V 57.50 56.26 63.30 59.02
CODA-VLM 55.04 77.68 58.14 63.62
InternVL-2.0-26B 43.39 64.91 48.04 52.11
NexusAD (Ours) 57.58 84.31 65.02 68.97

💡 Contribution Guidelines

We welcome all forms of contributions! Please refer to CONTRIBUTING.md for details on how to participate.


📜 License & Citation

This project is licensed under the MIT License. If you find this project helpful in your research, please cite it as follows:

@article{mo2024nexusad,
  title={NexusAD: Multimodal Perception and Comprehension of Corner Cases in Autonomous Driving},
  author={Mo, Mengjingcheng and Wang, Jingxin and Wang, Like and Chen, Haosheng and Gu, Changjun and Leng, Jiaxu and Gao, Xinbo},
  journal={ECCV 2024 Autonomous Driving Workshop},
  year={2024}
}

🙏 Acknowledgments

Special thanks to the following projects for providing key references and support for the development of NexusAD:

  • InternVL: Provided crucial technical support for the development of multimodal vision-language models.
  • CODA-LM: Provided datasets and resources for the corner case understanding task.

(Back to top)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published