Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Paper: Rethinking Information Synthesis in Multimodal Question Answering: A Multi-Agent Perspective 📚📊🔍
Authors: Tejas Anvekar, Krishna Singh Rajput, Chitta Baral, Vivek Gupta
Arizona State University

Overview

This repository contains the official implementation of MAMMQA, a prompt-driven, multi-agent system for multimodal question answering (MMQA).

The framework decomposes reasoning across three interpretable agents:

Modality Expert Agents — extract insights from text, tables, or images.
Cross-Modality Synthesis Agents — integrate these insights to form cross-modal reasoning chains.
Aggregator Agent — synthesizes multiple agent outputs into a final, evidence-grounded answer.

Unlike traditional monolithic or fine-tuned MMQA models, MAMMQA is zero-shot, modular, and LLM-agnostic, compatible with both OpenAI GPT-4o, Gemini 1.5-Flash, and Qwen2.5-VL models.

⚙️ Setup and Installation

Prerequisites Ensure you have Python 3.8+ installed.
Dependencies Install the required Python packages using pip:

pip install pandas openai tqdm python-dotenv

API Key Configuration The agents rely on the openai library to interface with various Large Language Models (LLMs) (e.g., GPT-4o-mini, Qwen, Gemini).

Create a file named .env in the root directory of the repository.

Add your API key for the chosen model (e.g., OpenAI or DashScope) to the file. The My Agents.py file uses environment variables like DASHSCOPE_API_KEY or OPENAI_API_KEY.

Example .env content:

OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Or
# DASHSCOPE_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

Dataset Preparation

Our experiments use the MULTIMODALQA and MANYMODALQA datasets.Create a root data directory, the datasets and structure the files to match the paths expected by the Dataloader.py and run_mt_script.py scripts.

How to Run the Agent

The main execution script is Run MT Script.py. It uses multithreading for efficient evaluation and allows configuration via command-line arguments. Running MULTIMODALQARun the evaluation on the MultiModalQA benchmark: \

python "run_mt_script.py" \
    --dataset_type "multimqa" \
    --dev_file "./data/MultiModalQA/endgame_dev_filtered_data.json" \
    --tables_file "./data/MultiModalQA/MMQA_tables.jsonl" \
    --texts_file "./data/MultiModalQA/MMQA_texts.jsonl" \
    --images_file "./data/MultiModalQA/MMQA_images.jsonl" \
    --images_base_url "./data/MultiModalQA/final_dataset_images" \
    --model "gpt-4o-mini" \
    --results_csv "multimqa_results.csv" \
    --num_iterations 100 \
    --num_threads 16

Running MANYMODALQA Run the evaluation on the ManyModalQA benchmark:

python "run_mt_script.py" \
    --dataset_type "manymqa" \
    --dev_file "./data/ManyModalQA/ManyModalQAData/official_aaai_split_dev_data.json" \
    --tables_file "./data/MultiModalQA/MMQA_tables.jsonl" \
    --texts_file "./data/MultiModalQA/MMQA_texts.jsonl" \
    --images_file "./data/MultiModalQA/MMQA_images.jsonl" \
    --images_base_url "./data/ManyModalQA/ManyModalImages" \
    --model "gpt-4o-mini" \
    --results_csv "manymqa_results.csv" \
    --num_iterations 100 \
    --num_threads 16

📖 Cite us:

@misc{rajput2025rethinkinginformationsynthesismultimodal,
      title={Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective}, 
      author={Krishna Singh Rajput and Tejas Anvekar and Chitta Baral and Vivek Gupta},
      year={2025},
      eprint={2505.20816},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.20816}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Assets		Assets
Dataloader.py		Dataloader.py
Eval.py		Eval.py
LICENSE		LICENSE
README.md		README.md
agents.py		agents.py
prompts.py		prompts.py
run_mt_script.py		run_mt_script.py
tot_dfs.py		tot_dfs.py
treeofthoughts.py		treeofthoughts.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Overview

⚙️ Setup and Installation

Dataset Preparation

How to Run the Agent

📖 Cite us:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Overview

⚙️ Setup and Installation

Dataset Preparation

How to Run the Agent

📖 Cite us:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages