# CodeAgent: LLM-based Agent Framework for Repository-level Code Generation

This notebook is a **thin orchestrator** for the CodeAgent framework, a replication of the paper:
[CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems](https://arxiv.org/abs/2401.07339)

## Overview

CodeAgent leverages external tools to assist LLMs in repository-level code generation tasks. Unlike simple function-level generation, repository-level code generation requires understanding:
- Documentation and README files
- Code dependencies and imports
- Runtime environment and testing infrastructure
- Existing code patterns and conventions

## Architecture

The framework consists of five core programming tools:
1. **FormatCheckTool**: Python code formatting using `black`
2. **CodeSymbolNavigationTool**: AST-based code navigation using tree-sitter
3. **CodeInterpreterTool**: Python code execution in isolated environments
4. **DocSearchTool**: BM25-based documentation search
5. **WebsiteSearchTool**: DuckDuckGo web search with summarization

## Project Structure

```
codeagent/
├── src/codeagent/
│   ├── config/       # Configuration and settings
│   ├── llm/          # LLM providers (HuggingFace, OpenAI, Gemini)
│   ├── tools/        # Programming tools
│   ├── benchmarks/   # Benchmark datasets
│   ├── agents/       # Agent factory and strategies
│   ├── evaluation/   # Evaluation pipeline
│   └── utils/        # Utility functions
├── tests/            # Test suite
└── CodeAgent_Final.ipynb  # This orchestrator notebook
```

## Phase 0: Setup & Configuration

Initialize the environment, set random seeds for reproducibility, and configure project paths.

This phase establishes a clean, consistent workspace and ensures experiments are deterministic.

In [None]:
# Add the source directory to the Python path
import sys
sys.path.insert(0, "./src")

# Core imports from the codeagent package
from pathlib import Path
from codeagent import (
    CodeAgentConfig,
    fix_random_seeds,
    create_llm,
    get_all_tools,
)
from codeagent.agents import create_agent_executor
from codeagent.benchmarks import MiniTransformersBench, CodeAgentBench
from codeagent.evaluation import (
    run_evaluation_pipeline,
    calculate_pass_rate,
    generate_report,
    print_summary,
)

# Configuration
config = CodeAgentConfig(
    project_repo_path=Path("./mini_transformers_repo"),
    random_seed=42,
)

# Fix random seeds for reproducibility
fix_random_seeds(config.random_seed)

print(f"Project repository path: {config.project_repo_path}")
print(f"Random seed: {config.random_seed}")
print("Setup complete.")

## Phase 1: LLM Configuration

Configure and load the LLM. The framework supports multiple providers:

| Provider | Description | Model Examples |
|----------|-------------|----------------|
| **gemini** | Google's Gemini API | gemini-2.5-flash, gemini-pro |
| **openai** | OpenAI's API | gpt-4, gpt-4-turbo, gpt-3.5-turbo |
| **deepseek** | DeepSeek via OpenRouter | deepseek/deepseek-chat |
| **huggingface** | Local models with quantization | codellama/CodeLlama-7b-hf |

### API Key Configuration

Set your API keys as environment variables:
```bash
export GOOGLE_API_KEY="your-gemini-key"
export OPENAI_API_KEY="your-openai-key"
export OPENROUTER_API_KEY="your-openrouter-key"
```

In [None]:
# Choose your LLM provider: "gemini", "openai", "deepseek", or "huggingface"
SELECTED_LLM = "gemini"

# Optional: Specify a model ID (defaults are sensible for each provider)
MODEL_ID = None  # e.g., "gemini-2.5-flash", "gpt-4", "deepseek/deepseek-chat"

# Load the LLM
llm, llm_ready = create_llm(
    provider=SELECTED_LLM,
    model_id=MODEL_ID,
)

if llm_ready:
    print(f"LLM '{SELECTED_LLM}' loaded successfully.")
else:
    print(f"Warning: LLM '{SELECTED_LLM}' failed to load. Check your API keys.")

## Phase 2: Benchmark Loading

Load the benchmark dataset. Two benchmarks are available:

### MiniTransformers Benchmark (Recommended for Development)
A smaller benchmark designed for iterative development and testing:
- **22 source files** implementing a minimal transformer architecture
- **15 tasks** covering additive, fix, and refactoring operations
- **~98 words** average instruction length
- Fast iteration cycles for debugging agent behavior

### CodeAgentBench (Full Evaluation)
The complete benchmark from the numpy-ml repository:
- **57 tasks** (51 class generation, 6 function generation)
- **~340 words** average instruction length
- Complex files with up to 9,000 lines
- Tests agent's ability to handle large codebases

In [None]:
# Choose your benchmark: MiniTransformersBench (smaller) or CodeAgentBench (full)
benchmark = MiniTransformersBench()

# Load the codebase and tasks
codebase_df = benchmark.load_codebase()
tasks_df = benchmark.load_tasks()

print(f"Benchmark: {benchmark.__class__.__name__}")
print(f"Codebase: {len(codebase_df)} files")
print(f"Tasks: {len(tasks_df)} tasks")
print(f"\nTask IDs: {tasks_df['task_id'].tolist()}")
print(f"\nTask types:")
print(tasks_df.groupby('title')['task_id'].count())

## Phase 3: Tools Setup

Initialize the five programming tools that enable the agent to interact with the codebase:

| Tool | Purpose | Key Technology |
|------|---------|----------------|
| **FormatCheck** | Validates Python code formatting | `black` formatter |
| **CodeSymbolNavigation** | Searches/navigates code symbols | tree-sitter AST |
| **CodeInterpreter** | Executes Python code safely | Isolated subprocess |
| **DocSearch** | Searches documentation | BM25 ranking |
| **WebSearch** | Searches the web | DuckDuckGo API |

Each tool is designed to extend the LLM's capabilities beyond pure text generation.

In [None]:
# Create all tools configured for the project
tools = get_all_tools(config.project_repo_path)

print(f"Initialized {len(tools)} tools:")
for tool in tools:
    print(f"\n  {tool.name}:")
    print(f"    {tool.description[:80]}...")

## Phase 4: Agent Creation

Create the agent executor with the chosen strategy. The agent orchestrates the LLM and tools.

### Agent Strategies

| Strategy | Description | Best For |
|----------|-------------|----------|
| **react** | ReAct with explicit Thought/Action/Observation | Debugging, understanding agent reasoning |
| **tool_calling** | Native tool calling (OpenAI/Gemini) | Production use, faster execution |

The ReAct strategy is recommended for development as it provides clear visibility into the agent's decision-making process.

In [None]:
# Choose agent strategy: "react" or "tool_calling"
AGENT_STRATEGY = "react"

# Create the agent executor
if llm_ready:
    agent_executor = create_agent_executor(
        llm=llm,
        tools=tools,
        strategy=AGENT_STRATEGY,
        project_repo_path=config.project_repo_path,
        max_iterations=25,
        verbose=True,
    )
    print(f"Agent created with '{AGENT_STRATEGY}' strategy.")
    print(f"Max iterations: 25")
else:
    agent_executor = None
    print("Skipping agent creation: LLM not ready.")

## Phase 5: Evaluation Pipeline

Run the evaluation pipeline on the benchmark tasks. For each task, the pipeline:

1. **Repository Setup**: Reconstructs the codebase for the task
2. **Agent Execution**: Runs the agent with the task prompt
3. **Verification**: Runs pytest to verify the generated code
4. **Result Collection**: Records pass/fail and logs

### Configuration Options
- `task_ids`: Run specific tasks only (default: all)
- `start_from_task`: Resume from a specific task
- `delay_between_tasks`: Rate limiting for API calls

In [None]:
# Optional: Specify which tasks to run (None = all tasks)
TASK_IDS_TO_RUN = None  # e.g., ["miniformer-01", "miniformer-02"]

# Optional: Resume from a specific task
START_FROM_TASK = None  # e.g., "miniformer-05"

# Run the evaluation pipeline
if agent_executor is not None:
    results = run_evaluation_pipeline(
        agent_executor=agent_executor,
        codebase_df=codebase_df,
        task_df=tasks_df,
        project_repo_path=config.project_repo_path,
        task_ids=TASK_IDS_TO_RUN,
        start_from_task=START_FROM_TASK,
        test_timeout=60,
        delay_between_tasks=2.0,
        print_results=True,
        llm_name=SELECTED_LLM,
    )
else:
    results = []
    print("Skipping evaluation: Agent not configured.")

## Phase 6: Results Analysis

Analyze the evaluation results, generate reports, and examine failed tasks.

Key metrics:
- **Pass@1 Rate**: Percentage of tasks solved on first attempt
- **Per-task breakdown**: Individual task outcomes
- **Failure analysis**: Verification logs for debugging

In [None]:
if results:
    # Generate summary report
    report_df = generate_report(results)
    display(report_df)
    
    # Calculate pass rate
    pass_rate = calculate_pass_rate(results)
    print(f"\nOverall Pass@1 Rate: {pass_rate:.2%}")
    print(f"Passed: {sum(1 for r in results if r.success)}/{len(results)} tasks")
    
    # Show failed tasks
    failed = [r for r in results if not r.success]
    if failed:
        print(f"\n--- Failed Tasks ({len(failed)}) ---")
        for r in failed:
            print(f"  - {r.task_id}: {r.title}")
else:
    print("No results to analyze.")

## Phase 7: No-Agent Baseline (Optional)

Run a baseline evaluation **without** the agent loop to measure the improvement from tool usage.

The baseline:
- Provides the LLM with all context (task description + existing code)
- Asks for a single-shot code generation
- No tool access, no iteration

This demonstrates the value of the agentic approach with tools.

In [None]:
from codeagent.evaluation import run_no_agent_baseline, compare_results

RUN_BASELINE = False  # Set to True to run baseline comparison

if RUN_BASELINE and llm_ready:
    print("Running no-agent baseline evaluation...")
    baseline_results = run_no_agent_baseline(
        llm_instance=llm,
        codebase_df=codebase_df,
        task_df=tasks_df,
        project_repo_path=config.project_repo_path,
        task_ids=TASK_IDS_TO_RUN,
        delay_between_tasks=2.0,
    )
    
    # Compare agent vs baseline
    if results:
        comparison = compare_results(
            agent_results=results,
            baseline_results=baseline_results,
            agent_name=f"{SELECTED_LLM}_Agent",
            baseline_name=f"{SELECTED_LLM}_NoAgent",
        )
        display(comparison)
else:
    print("Baseline comparison skipped. Set RUN_BASELINE=True to enable.")

## Phase 8: HumanEval Evaluation (Optional)

Run the [HumanEval benchmark](https://github.com/openai/human-eval) for function-level code generation.

HumanEval contains 164 Python programming problems testing:
- Algorithm implementation
- String manipulation
- Data structure operations
- Mathematical functions

This provides a complementary evaluation to repository-level tasks.

In [None]:
from codeagent.evaluation import run_full_humaneval_pipeline

RUN_HUMANEVAL = False  # Set to True to run HumanEval
HUMANEVAL_SAMPLES = 15  # Number of problems to run (None = all 164)

if RUN_HUMANEVAL and agent_executor is not None:
    print(f"Running HumanEval evaluation ({HUMANEVAL_SAMPLES or 164} problems)...")
    humaneval_results = run_full_humaneval_pipeline(
        agent_executor=agent_executor,
        output_dir=Path("./evaluation_results/HumanEval"),
        num_samples=HUMANEVAL_SAMPLES,
        llm_name=SELECTED_LLM,
        delay_between_tasks=5.0,
    )
    
    print(f"\nHumanEval Pass@1: {humaneval_results['results'].get('pass@1', 0):.2%}")
else:
    print("HumanEval evaluation skipped. Set RUN_HUMANEVAL=True to enable.")

---

## Summary

This notebook demonstrated the complete CodeAgent workflow:

| Phase | Component | Purpose |
|-------|-----------|----------|
| 0 | Setup | Configuration, seeds, paths |
| 1 | LLM | Load language model |
| 2 | Benchmark | Load evaluation tasks |
| 3 | Tools | Initialize programming tools |
| 4 | Agent | Create agent executor |
| 5 | Evaluation | Run evaluation pipeline |
| 6 | Analysis | Examine results and metrics |
| 7 | Baseline | Compare with no-tool approach |
| 8 | HumanEval | Function-level evaluation |

### Key Findings from the Paper

1. **Tools Matter**: The agent with tools significantly outperforms the base LLM
2. **FormatCheck Impact**: Code formatting validation catches many errors early
3. **CodeInterpreter Value**: Runtime testing enables iterative debugging
4. **Context is Key**: Repository-level context enables accurate generation

### Next Steps

- Experiment with different LLM providers
- Run the full CodeAgentBench for comprehensive evaluation
- Analyze tool usage patterns to understand agent behavior
- Compare performance across model sizes