AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.
-
Updated
Jul 11, 2025 - Rust
AI Agent that handles engineering tasks end-to-end: integrates with developers’ tools, plans, executes, and iterates until it achieves a successful result.
This project explores how Large Language Models (LLMs) perform on real-world software engineering tasks, inspired by the SWE-Bench benchmark. Using locally hosted models like Llama 3 via Ollama, the tool evaluates code repair capabilities on Python repositories through custom test cases and a lightweight scoring framework.
Add a description, image, and links to the swe-bench topic page so that developers can more easily learn about it.
To associate your repository with the swe-bench topic, visit your repo's landing page and select "manage topics."