# Cline Prompt Learning Optimization on SWE-bench - Act Mode

<p align="left">
  <span style="display: inline-block; width: 500px; height: auto; overflow: hidden; vertical-align: middle;">
    <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix.jpeg" style="margin: -10px -50px; width: 600px;" />
  </span>
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/cline.png" width="200" style="vertical-align: middle;" />
</p>

This notebook demonstrates how we used Prompt Learning to optimize Cline's performance on the SWE-bench dataset in **Act Mode**. Cline is a popular and powerful open-source coding agent. We look to improve its performance on SWE-bench by optimizing its **rules**, which are user specified instructions that Cline appends to its system prompt. 

[More on Cline](https://www.google.com/search?q=cline&sca_esv=764700c983d0c1df&sxsrf=AE3TifOTqxMNetNu45T7bn53deGE6bPn3w%3A1759280858717&ei=2n7caKLCK6XK0PEPpuCv6AY&ved=0ahUKEwiil5P154GQAxUlJTQIHSbwC20Q4dUDCBA&uact=5&oq=cline&gs_lp=Egxnd3Mtd2l6LXNlcnAiBWNsaW5lMgoQIxiABBgnGIoFMgoQIxiABBgnGIoFMgoQIxiABBgnGIoFMhMQLhiABBixAxjRAxhDGMcBGIoFMhMQLhiABBixAxjRAxgUGIcCGMcBMgoQABiABBhDGIoFMgoQABiABBhDGIoFMgUQABiABDIFEAAYgAQyChAAGIAEGEMYigVIkw1Q8QNYpgxwA3gBkAEAmAG2AaABkAKqAQMxLjG4AQPIAQD4AQGYAgWgApwCwgIKEAAYsAMY1gQYR8ICDRAAGIAEGLADGEMYigXCAg4QABiwAxjkAhjWBNgBAcICDhAuGLADGLgGGMgD2AEBwgIQEC4YgAQY0QMYQxjHARiKBZgDAIgGAZAGE7oGBggBEAEYCZIHAzQuMaAHwhiyBwMxLjG4B5QCwgcDMC41yAcJ&sclient=gws-wiz-serp#:~:text=Cline%20%2D%20AI%20Coding,https%3A//cline.bot)

[More on Prompt Learning](https://arize.com/blog/prompt-learning-using-english-feedback-to-optimize-llm-systems/)

## Act Mode - Real Code Execution

Unlike Plan Mode, this notebook runs Cline in **Act Mode**, where Cline actually edits the codebase and generates patches. We then run the SWE-bench tests to compute a definitive accuracy of whether Cline made the correct edits. This provides ground truth evaluation of Cline's performance.

In Act Mode, Cline:
1. Analyzes the problem statement
2. Explores the codebase
3. Makes actual code edits
4. Generates patches
5. Has its patches validated against the SWE-bench test suite

## SWE Bench + Cline Setup

**Please visit README.md and complete all the Setup before running this notebook!**

## Phoenix

We use Phoenix - an open source library for LLM development. We specifically leverage the experiments feature, so we can track Cline's improvements over time, as we optimize its ruleset.

Visit phoenix.arize.com and sign-in/create account.

## Important Note

Running this notebook is computationally intensive and expensive as it involves:
- Multiple API calls to Claude for each SWE-bench instance
- Actually cloning repositories and running tests in isolated environments
- Running SWE-bench harness to validate patches

Consider adjusting the training and test set sizes based on your requirements, budget constraints, and computational resources.


In [None]:
%pip install -qq swebench arize-phoenix arize-phoenix-client pandas

In [None]:
from run_act import run_act
from swebench.harness.utils import load_swebench_dataset
import random
from evals import evaluate_results
import os
import sys
import subprocess
from pathlib import Path
import pandas as pd
import getpass

notebook_dir = Path().absolute()
sys.path.insert(0, str(notebook_dir.parent.parent))
from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer

sys.path.insert(0, str(notebook_dir.parent))
from constants import CLINE_PROMPT

  from .autonotebook import tqdm as notebook_tqdm
2025-11-06 13:33:55,705 - phoenix.config - INFO - üìã Ensuring phoenix working directory: /Users/priyanjindal/.phoenix
2025-11-06 13:33:55,715 - phoenix.inferences.inferences - INFO - Dataset: phoenix_inferences_af758399-17f4-40f3-899e-4d42fb1aa4d0 initialized


## API Keys

Set up your API keys for OpenAI, Anthropic, and Arize. If not already in your environment, you'll be prompted to enter them.


In [4]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
# os.environ["ANTHROPIC_API_KEY"] = os.getenv("ANTHROPIC_API_KEY") or getpass.getpass("Enter your Anthropic API key: ")
os.environ["PHOENIX_API_KEY"] = os.getenv("PHOENIX_API_KEY") or getpass.getpass("Enter your Phoenix API key: ")
HOSTNAME = os.getenv("PHOENIX_HOSTNAME") or getpass.getpass("Enter your Phoenix hostname: ")

## Configuration

- LOOPS: number of Prompt Learning loops. How many times you want to optimize your rules.
We will be starting with a blank, empty ruleset. So iteration #1 generates a set of rules from scratch, and all loops after that will look to optimize it. 
- TRAIN_SIZE: size of training set.
- TEST_SIZE: size of test set.
- WORKERS: SWE-bench with Cline is set up to run in parallel, with however many workers you specify. Set this relative to your machine's capabilities and your LLM rate limits.


In [5]:
LOOPS = 2
TRAIN_SIZE = 5
TEST_SIZE = 5
WORKERS = 10

## Cline Environment Configuration

Set environment variables for Cline to run properly in Act Mode.


In [6]:
os.environ["CLINE_DISABLE_TERMINAL_REUSE"] = "1"
os.environ["CLINE_DEFAULT_TERMINAL_PROFILE"] = "bash"
os.environ["CLINE_SHELL_INTEGRATION_TIMEOUT_SEC"] = "10"
os.environ["CLINE_STANDALONE_CAPTURE_STDIO"] = "1"
os.environ["CLINE_SKIP_RESUME_CONFIRMATION"] = "1"
os.environ["CLINE_AUTO_FOLLOWUP"] = "1"
os.environ["CLINE_ENVIRONMENT"] = "local"

## Train/Test Datasets

This code splits SWE-bench Lite into train/test splits.

The train set will be used to optimize the ruleset, while the test set will be used to measure the success of optimized rulesets.


In [7]:
dataset_name = "SWE-bench/SWE-bench_Lite"
dataset = load_swebench_dataset(dataset_name, "test")
ids = [inst["instance_id"] for inst in dataset]
random.seed(40)
random.shuffle(ids)
train_ids = ids[:TRAIN_SIZE]        
test_ids = ids[len(ids) - TEST_SIZE:]
by_id = {ex["instance_id"]: ex for ex in dataset}
train_dataset = [by_id[i] for i in train_ids]
test_dataset  = [by_id[i] for i in test_ids]

train_pd = pd.DataFrame(train_dataset)
test_pd  = pd.DataFrame(test_dataset)


## Upload Datasets to Phoenix

Upload datasets to Phoenix for experiment tracking and visualization.


In [8]:
from phoenix.client import Client

phoenix_client = Client(base_url=HOSTNAME, api_key=os.getenv("PHOENIX_API_KEY"))

train_dataset = phoenix_client.datasets.create_dataset(
    name="Cline Act Mode: SWE-bench Train",
    dataset_description="Cline Act Mode: SWE-bench Train",
    dataframe=train_pd,
    input_keys=['problem_statement'],
    metadata_keys=['instance_id', 'test_patch'],
    output_keys=[]
)

test_dataset = phoenix_client.datasets.create_dataset(
    name="Cline Act Mode: SWE-bench Test",
    dataset_description="Cline Act Mode: SWE-bench Test",
    dataframe=test_pd,
    input_keys=['problem_statement'],
    metadata_keys=['instance_id', 'test_patch'],
    output_keys=[]
)

2025-11-06 11:48:42,097 - phoenix.client.resources.datasets - INFO - Uploading dataset...
2025-11-06 11:48:42,132 - phoenix.client.resources.datasets - INFO - Dataset uploaded successfully. ID: RGF0YXNldDoxOA==, Version: RGF0YXNldFZlcnNpb246MTg=
2025-11-06 11:48:42,133 - phoenix.client.resources.datasets - INFO - Uploading dataset...
2025-11-06 11:48:42,254 - phoenix.client.resources.datasets - INFO - Dataset uploaded successfully. ID: RGF0YXNldDoxOQ==, Version: RGF0YXNldFZlcnNpb246MTk=


## Helper: Log Experiments to Phoenix

This helper function logs experiment results to Phoenix, allowing us to visualize and track optimization progress across iterations.


In [9]:
from phoenix_experiments import log_experiment_to_phoenix

## Ruleset Optimization Loop

This is the main optimization loop. For each iteration:

1. **Run Cline in Act Mode on training set** with the current ruleset, generating actual code patches
2. **Run Cline in Act Mode on test set** with the current ruleset to measure generalization
3. **Run SWE-bench tests** to validate patches and compute pass/fail metrics
4. **Evaluate results** using LLM-as-judge to provide detailed feedback on patch quality
5. **Optimize the ruleset** using Prompt Learning based on training results and feedback
6. **Save results and rulesets** for tracking and analysis

The optimization loop uses actual test execution results (pass/fail) as ground truth, combined with LLM evaluator feedback to iteratively improve the ruleset.


In [None]:
ruleset = "No rules"

for loop in range(1):
    print(f"Running for loop: {loop}")

    train_run_id = f"train_{loop}"
    test_run_id = f"test_{loop}"

    train_df = run_act(dataset_name=dataset_name, instance_ids=train_ids, run_id=train_run_id, ruleset=ruleset, workers=WORKERS)
    test_df = run_act(dataset_name=dataset_name, instance_ids=test_ids, run_id=test_run_id, ruleset=ruleset, workers=WORKERS)

    test_df.to_csv(f"act_results/test_results_{loop}.csv", index=False)
    
    train_acc = sum(train_df["pass_or_fail"] == "pass") / len(train_df)
    test_acc = sum(test_df["pass_or_fail"] == "pass") / len(test_df)
    print(f"Train Accuracy: {train_acc}")
    print(f"Test Accuracy: {test_acc}")

    # make sure any swebench package installations did not affect phoenix package

    subprocess.run([
        "/opt/anaconda3/envs/cline/bin/python3",
        "-m",
        "pip",
        "install",
        "-qq",
        "--upgrade",
        "arize-phoenix",
        "wrapt",
    ],
    stdout=subprocess.DEVNULL,
    stderr=subprocess.DEVNULL)

    evaluated_train_results = evaluate_results(train_df)
    evaluated_train_results.to_csv(f"act_results/train_results_{loop}.csv", index=False)
    
    # Log experiment to Phoenix using REST API
    log_experiment_to_phoenix(
        hostname=HOSTNAME,
        api_key=os.getenv("PHOENIX_API_KEY"),
        dataset_obj=train_dataset,
        experiment_name=f"Train {loop}",
        experiment_df=evaluated_train_results,
        metadata={
            "loop": loop,
            "train_accuracy": train_acc,
            "test_accuracy": test_acc,
            "train_size": TRAIN_SIZE,
            "test_size": TEST_SIZE
        }
    )

    pl_optimizer = PromptLearningOptimizer(
        prompt=CLINE_PROMPT,
        model_choice="gpt-5",
        openai_api_key=os.getenv("OPENAI_API_KEY")
    )
    ruleset = pl_optimizer.optimize(
        dataset=evaluated_train_results,
        output_column="cline_patch",
        feedback_columns=["correctness", "explanation"],
        ruleset=ruleset,
        context_size_k=400000
    )
    with open(f"act_rulesets/ruleset_{loop}.txt", "w") as f:
        f.write(f"train_accuracy: {train_acc} \n")
        f.write(f"test_accuracy: {test_acc} \n")
        f.write(f"optimized ruleset_{loop}: \n {ruleset} \n")



Running for loop: 0
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-20442
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-15308
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-13043
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-22005
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/scikit-learn__scikit-learn-14087


[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27031.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27021.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27001.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27041.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27011.log


[DEBUG] export_patch: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-20442
[DEBUG] staging changes for diff (excluding sqlite/db artifacts)
[DEBUG] export_patch: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-15308
[DEBUG] staging changes for diff (excluding sqlite/db artifacts)
[DEBUG] export_patch: ws=/Users/priyanjindal/materialized_repos/scikit-learn__scikit-learn-14087
[DEBUG] staging changes for diff (excluding sqlite/db artifacts)
[DEBUG] staged files:
sympy/physics/units/tests/test_util.py
sympy/physics/units/util.py

[DEBUG] unstaged files:

[DEBUG] diff bytes=3504
[DEBUG] wrote predictions: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/preds_ocnqkkeb.jsonl
[DEBUG] staged files:
sklearn/linear_model/logistic.py
sklearn/linear_model/tests/test_logistic.py

[DEBUG] unstaged files:

[DEBUG] diff bytes=3582
[DEBUG] wrote predictions: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/preds_y87s2ffo.jsonl
[DEBUG] staged files:
.task_progress.md
sympy/printing/latex

[0m





5 instances already run, skipping...
No instances to run.
Cleaning cached images...
Removed 0 images.
Total instances: 5
Instances submitted: 5
Instances completed: 5
Instances incomplete: 0
Instances resolved: 2
Instances unresolved: 3
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 1
Unremoved images: 5
Report written to cline.train_0.json
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-13177
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-24102
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-18532
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/matplotlib__matplotlib-23476
[DEBUG] ensure_git_baseline: ws=/Users/priyanjindal/materialized_repos/django__django-11422


[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27041.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27021.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27031.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27001.log
[INFO] Starting standalone server; log: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/cline-python-server-27011.log


[DEBUG] export_patch: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-24102
[DEBUG] staging changes for diff (excluding sqlite/db artifacts)
[DEBUG] export_patch: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-13177
[DEBUG] staging changes for diff (excluding sqlite/db artifacts)
[DEBUG] export_patch: ws=/Users/priyanjindal/materialized_repos/sympy__sympy-18532
[DEBUG] staging changes for diff (excluding sqlite/db artifacts)
[DEBUG] staged files:
sympy/assumptions/sathandlers.py
sympy/core/basic.py
sympy/core/mod.py
sympy/core/tests/test_arit.py
sympy/plotting/plot.py

[DEBUG] unstaged files:

[DEBUG] diff bytes=4050
[DEBUG] wrote predictions: /var/folders/_w/glgvmwgs3s5g81607b0x435c0000gn/T/preds_39fare3_.jsonl
[DEBUG] staged files:
sympy/core/basic.py
sympy/core/tests/test_basic.py
sympy/core/tests/test_expr.py

[DEBUG] staged files:
.parse_mathematica_unicode_todo.md
sympy/parsing/mathematica.py
sympy/parsing/tests/test_mathematica.py

[DEBUG] unstaged files:

[DEBUG]

[0m





5 instances already run, skipping...
No instances to run.
Cleaning cached images...
Removed 0 images.
Total instances: 5
Instances submitted: 5
Instances completed: 5
Instances incomplete: 0
Instances resolved: 2
Instances unresolved: 3
Instances with empty patches: 0
Instances with errors: 0
Unstopped containers: 0
Unremoved images: 5
Report written to cline.test_0.json
Train Accuracy: 0.4
Test Accuracy: 0.4
using updated 2.0 script
‚úì Created experiment 'Train 0' (ID: RXhwZXJpbWVudDoxMDA=)
‚úì Mapped 5 examples from dataset
example_id_map
{'sympy__sympy-20442': 'RGF0YXNldEV4YW1wbGU6MTk0Ng==', 'scikit-learn__scikit-learn-14087': 'RGF0YXNldEV4YW1wbGU6MTk0Nw==', 'sympy__sympy-13043': 'RGF0YXNldEV4YW1wbGU6MTk0OA==', 'sympy__sympy-22005': 'RGF0YXNldEV4YW1wbGU6MTk0OQ==', 'sympy__sympy-15308': 'RGF0YXNldEV4YW1wbGU6MTk1MA=='}
‚úì Created 5 experiment runs (0 failed)
‚úì Created 5 evaluations (0 failed)
['instance_id', 'problem_statement', 'patch', 'test_patch', 'cline_patch', 'pass_or_fail'

# Results

Navigate to Phoenix Datasets and Experiments to view your Cline runs, where you can track its improvements. Your results will look something like this:

![My Image](https://storage.googleapis.com/arize-phoenix-assets/assets/images/Screenshot%202025-10-13%20at%2011.28.40%E2%80%AFAM.png)

As you can see, this run shows a 15% increase in Cline's accuracy on SWE Bench!


# Final Ruleset

You can see the rulesets that Cline generated at each optimization in `act_rulesets`. 