SWE-Model-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

title	emoji	colorFrom	colorTo	sdk	sdk_version	app_file	hf_oauth	pinned	short_description
SWE-Model-Arena	🎯	green	red	gradio	5.50.0	app.py	true	false	Chatbot arena for software engineering tasks

SWE-Model-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Welcome to SWE-Model-Arena, an open-source platform designed for evaluating software engineering-focused foundation models (FMs), particularly large language models (LLMs). SWE-Model-Arena benchmarks models in iterative, context-rich workflows that are characteristic of software engineering (SE) tasks.

Key Features

Multi-Round Conversational Workflows: Evaluate models through extended, context-dependent interactions that mirror real-world SE processes.
RepoChat Integration: Automatically inject repository context (issues, commits, PRs) into conversations for more realistic evaluations.
Advanced Evaluation Metrics: Assess models using a comprehensive suite of metrics including:
- Traditional ranking metrics: Elo ratings and win rates to measure overall model performance
- Network-based metrics: Eigenvector centrality and PageRank to identify influential models in head-to-head comparisons
- Community detection metrics: Newman modularity to reveal clusters of models with similar capabilities
- Consistency metrics: Self-play match analysis to quantify model determinism and reliability
- Efficiency metrics: Conversation efficiency index to measure response quality relative to length
Transparent, Open-Source Leaderboard: View real-time model rankings across diverse SE workflows with full transparency.
Intelligent Request Filtering: Employ gpt-oss-safeguard-20b as a guardrail to automatically filter out non-software-engineering-related requests, ensuring focused and relevant evaluations.

Why SWE-Model-Arena?

Existing evaluation frameworks (e.g. LMArena) often don't address the complex, iterative nature of SE tasks. SWE-Model-Arena fills critical gaps by:

Supporting context-rich, multi-turn evaluations to capture iterative workflows
Integrating repository-level context through RepoChat to simulate real-world development scenarios
Providing multidimensional metrics for nuanced model comparisons
Focusing on the full breadth of SE tasks beyond just code generation

How It Works

Submit a Prompt: Sign in and input your SE-related task (optional: include a repository URL for RepoChat context)
Compare Responses: Two anonymous models provide responses to your query
Continue the Conversation: Test contextual understanding over multiple rounds
Vote: Choose the better model at any point, with ability to re-assess after multiple turns

Getting Started

Prerequisites

A Hugging Face account

Usage

Navigate to the SWE-Model-Arena platform
Sign in with your Hugging Face account
Enter your SE task prompt (optionally include a repository URL for RepoChat)
Engage in multi-round interactions and vote on model performance

Contributing

We welcome contributions from the community! Here's how you can help:

Submit SE Tasks: Share your real-world SE problems to enrich our evaluation dataset
Report Issues: Found a bug or have a feature request? Open an issue in this repository
Enhance the Codebase: Fork the repository, make your changes, and submit a pull request

Privacy Policy

Your interactions are anonymized and used solely for improving SWE-Model-Arena and FM benchmarking. By using SWE-Model-Arena, you agree to our Terms of Service.

Future Plans

Analysis of Real-World SE Workloads: Identify common patterns and challenges in user-submitted tasks
Multi-Round Evaluation Metrics: Develop specialized metrics for assessing model adaptation over successive turns
Expanded FM Coverage: Include multimodal and domain-specific foundation models
Advanced Context Compression: Integrate techniques like LongRope and SelfExtend to manage long-term memory in multi-round conversations

Contact

For inquiries or feedback, please open an issue in this repository. We welcome your contributions and suggestions!

Citation

Made with ❤️ for SWE-Model-Arena. If this work is useful to you, please consider citing our vision paper:

@inproceedings{zhao2025se,
  title={SE Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering},
  author={Zhao, Zhimin},
  booktitle={2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge)},
  pages={78--81},
  year={2025},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
app.py		app.py
model_data.jsonl		model_data.jsonl
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SWE-Model-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Key Features

Why SWE-Model-Arena?

How It Works

Getting Started

Prerequisites

Usage

Contributing

Privacy Policy

Future Plans

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Software-Engineering-Arena/SWE-Model-Arena

Folders and files

Latest commit

History

Repository files navigation

SWE-Model-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Key Features

Why SWE-Model-Arena?

How It Works

Getting Started

Prerequisites

Usage

Contributing

Privacy Policy

Future Plans

Contact

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages