-
Notifications
You must be signed in to change notification settings - Fork 0
research: evaluate grepai vs Morph for semantic code search #875
Copy link
Copy link
Open
Description
Objective
Use AgentV as the benchmark harness to evaluate grepai vs Morph for semantic code search — demonstrating AgentV as an alternative to SWE-Bench for tool evaluation.
Context
Morph claims #1 on SWE-Bench Pro. Rather than relying on external benchmarks, we should design AgentV evals that measure code search tool effectiveness directly. This serves two purposes:
- Compare grepai vs Morph on dimensions that matter for agentic workflows
- Prove AgentV as a benchmark harness — if we can eval code search tools with AgentV, others can too
grepai (open-source, self-hosted)
- Go CLI, local vector embeddings + similarity search
- Swappable backends: embedders (Ollama, OpenAI, LM Studio) and vector stores (GOB, pgvector, Qdrant)
- Hybrid search: vector similarity + text matching via Reciprocal Rank Fusion (RRF)
- MCP server mode (
mcp-serve) exposes search as native AI agent tools - Multi-project workspace support with hierarchical config
- Research:
agentevals-research/research/findings/grepai/README.md
Morph (commercial API, YC-backed)
- WarpGrep: Dedicated search LLM, 8 parallel tool calls per turn, ~3.8 steps to results
- Fast Apply: Merges edit snippets at 10,500+ tok/s, 98% accuracy
- Compact: Compresses context 50-70% in <2s
- Claims BbEval TypeScript Migration #1 on SWE-Bench Pro, 15.8% cheaper and 22% faster
- Available as MCP server and liteLLM provider
Eval Design (AgentV as Harness)
Design AgentV eval cases that measure code search tools on real tasks:
- Retrieval accuracy: Given a natural language query + known-relevant files, does the tool return them? (precision/recall)
- End-to-end task completion: Agent with grepai MCP vs agent with Morph MCP — which leads to more correct solutions?
- Latency & cost: Measure wall-clock time and token/compute cost per search across eval runs
- Context efficiency: How much relevant context does each tool surface vs noise?
- Privacy tradeoff: Local-only (grepai) vs API-dependent (Morph) — eval with air-gapped constraints
Non-Goals
- Reproducing SWE-Bench itself inside AgentV
- Building code search into AgentV core
Acceptance Signals
- AgentV eval file(s) that benchmark code search tool effectiveness
- Comparison results: grepai vs Morph graded by AgentV
- Writeup on AgentV-as-harness viability for tool evaluation
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status