# Cline Prompt Learning Optimization on SWE-bench - Act Mode

<p align="left">
  <span style="display: inline-block; width: 500px; height: auto; overflow: hidden; vertical-align: middle;">
    <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix.jpeg" style="margin: -10px -50px; width: 600px;" />
  </span>
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/cline.png" width="200" style="vertical-align: middle;" />
</p>

This notebook demonstrates how we used Prompt Learning to optimize Cline's performance on the SWE-bench dataset in **Act Mode**. Cline is a popular and powerful open-source coding agent. We look to improve its performance on SWE-bench by optimizing its **rules**, which are user specified instructions that Cline appends to its system prompt. 

[More on Cline](https://www.google.com/search?q=cline&sca_esv=764700c983d0c1df&sxsrf=AE3TifOTqxMNetNu45T7bn53deGE6bPn3w%3A1759280858717&ei=2n7caKLCK6XK0PEPpuCv6AY&ved=0ahUKEwiil5P154GQAxUlJTQIHSbwC20Q4dUDCBA&uact=5&oq=cline&gs_lp=Egxnd3Mtd2l6LXNlcnAiBWNsaW5lMgoQIxiABBgnGIoFMgoQIxiABBgnGIoFMgoQIxiABBgnGIoFMhMQLhiABBixAxjRAxhDGMcBGIoFMhMQLhiABBixAxjRAxgUGIcCGMcBMgoQABiABBhDGIoFMgoQABiABBhDGIoFMgUQABiABDIFEAAYgAQyChAAGIAEGEMYigVIkw1Q8QNYpgxwA3gBkAEAmAG2AaABkAKqAQMxLjG4AQPIAQD4AQGYAgWgApwCwgIKEAAYsAMY1gQYR8ICDRAAGIAEGLADGEMYigXCAg4QABiwAxjkAhjWBNgBAcICDhAuGLADGLgGGMgD2AEBwgIQEC4YgAQY0QMYQxjHARiKBZgDAIgGAZAGE7oGBggBEAEYCZIHAzQuMaAHwhiyBwMxLjG4B5QCwgcDMC41yAcJ&sclient=gws-wiz-serp#:~:text=Cline%20%2D%20AI%20Coding,https%3A//cline.bot)

[More on Prompt Learning](https://arize.com/blog/prompt-learning-using-english-feedback-to-optimize-llm-systems/)

## Act Mode - Real Code Execution

Unlike Plan Mode, this notebook runs Cline in **Act Mode**, where Cline actually edits the codebase and generates patches. We then run the SWE-bench tests to compute a definitive accuracy of whether Cline made the correct edits. This provides ground truth evaluation of Cline's performance.

In Act Mode, Cline:
1. Analyzes the problem statement
2. Explores the codebase
3. Makes actual code edits
4. Generates patches
5. Has its patches validated against the SWE-bench test suite

## SWE Bench + Cline Setup

**Please visit README.md and complete all the Setup before running this notebook!**

## Phoenix Setup

Visit phoenix.arize.com and sign-in/create account.

## Important Note

Running this notebook is computationally intensive and expensive as it involves:
- Multiple API calls to Claude for each SWE-bench instance
- Actually cloning repositories and running tests in isolated environments
- Running SWE-bench harness to validate patches

Consider adjusting the training and test set sizes based on your requirements, budget constraints, and computational resources.


In [1]:
%pip install -qq swebench arize-phoenix arize-phoenix-client pandas

Note: you may need to restart the kernel to use updated packages.


In [2]:
from run_act import run_act
from swebench.harness.utils import load_swebench_dataset
import random
from evals_act import evaluate_results
import os
import sys
import subprocess
from pathlib import Path
import pandas as pd
import getpass

notebook_dir = Path().absolute()
sys.path.insert(0, str(notebook_dir.parent.parent))
from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer

sys.path.insert(0, str(notebook_dir.parent))
from constants import CLINE_PROMPT

  from .autonotebook import tqdm as notebook_tqdm
2025-10-10 16:14:15,552 - phoenix.config - INFO - 📋 Ensuring phoenix working directory: /Users/priyanjindal/.phoenix
2025-10-10 16:14:15,571 - phoenix.inferences.inferences - INFO - Dataset: phoenix_inferences_84e73990-f4fd-4283-91e2-b05b01d3d322 initialized


## API Keys

Set up your API keys for OpenAI, Anthropic, and Arize. If not already in your environment, you'll be prompted to enter them.


In [3]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
os.environ["ANTHROPIC_API_KEY"] = os.getenv("ANTHROPIC_API_KEY") or getpass.getpass("Enter your Anthropic API key: ")
os.environ["PHOENIX_API_KEY"] = os.getenv("PHOENIX_API_KEY") or getpass.getpass("Enter your Phoenix API key: ")
HOSTNAME = os.getenv("PHOENIX_HOSTNAME") or getpass.getpass("Enter your Phoenix hostname: ")

## Configuration

- LOOPS: number of Prompt Learning loops. How many times you want to optimize your prompt.
- TRAIN_SIZE: size of training set.
- TEST_SIZE: size of test set.
- WORKERS: SWE-bench is set up to run in parallel, with however many workers you specify. Set this relative to your machine's capabilities and your Claude rate limits.


In [46]:
LOOPS = 5
TRAIN_SIZE = 150
TEST_SIZE = 150
WORKERS = 52

## Cline Environment Configuration

Set environment variables for Cline to run properly in Act Mode.


In [5]:
os.environ["CLINE_DISABLE_TERMINAL_REUSE"] = "1"
os.environ["CLINE_DEFAULT_TERMINAL_PROFILE"] = "bash"
os.environ["CLINE_SHELL_INTEGRATION_TIMEOUT_SEC"] = "10"
os.environ["CLINE_STANDALONE_CAPTURE_STDIO"] = "1"
os.environ["CLINE_SKIP_RESUME_CONFIRMATION"] = "1"
os.environ["CLINE_AUTO_FOLLOWUP"] = "1"
os.environ["CLINE_ENVIRONMENT"] = "local"

## Train/Test Datasets

This code splits SWE-bench Lite into train/test splits.

The train set will be used to optimize the ruleset, while the test set will be used to measure the success of optimized rulesets.


In [67]:
dataset_name = "SWE-bench/SWE-bench_Lite"
dataset = load_swebench_dataset(dataset_name, "test")
ids = [inst["instance_id"] for inst in dataset]
random.seed(40)
random.shuffle(ids)
train_ids = ids[:TRAIN_SIZE]        
test_ids = ids[len(ids) - TEST_SIZE:]
by_id = {ex["instance_id"]: ex for ex in dataset}
train_dataset = [by_id[i] for i in train_ids]
test_dataset  = [by_id[i] for i in test_ids]

train_pd = pd.DataFrame(train_dataset)
test_pd  = pd.DataFrame(test_dataset)


## Upload Datasets to Phoenix

Upload datasets to Arize for experiment tracking and visualization.


In [68]:
from phoenix.client import Client

phoenix_client = Client(base_url=HOSTNAME, api_key=os.getenv("PHOENIX_API_KEY"))

train_dataset = phoenix_client.datasets.create_dataset(
    name="Cline Act Mode: SWE-bench Train",
    dataset_description="Cline Act Mode: SWE-bench Train",
    dataframe=train_pd,
    input_keys=['problem_statement'],
    metadata_keys=['instance_id', 'test_patch'],
    output_keys=[]
)

test_dataset = phoenix_client.datasets.create_dataset(
    name="Cline Act Mode: SWE-bench Test",
    dataset_description="Cline Act Mode: SWE-bench Test",
    dataframe=test_pd,
    input_keys=['problem_statement'],
    metadata_keys=['instance_id', 'test_patch'],
    output_keys=[]
)

2025-10-10 17:48:05,715 - phoenix.client.resources.datasets - INFO - Uploading dataset...
2025-10-10 17:48:06,346 - httpx - INFO - HTTP Request: POST https://app.phoenix.arize.com/s/cline-priyan/v1/datasets/upload?sync=true "HTTP/1.1 200 OK"
2025-10-10 17:48:06,417 - httpx - INFO - HTTP Request: GET https://app.phoenix.arize.com/s/cline-priyan/v1/datasets/RGF0YXNldDoxNA%3D%3D "HTTP/1.1 200 OK"
2025-10-10 17:48:06,559 - httpx - INFO - HTTP Request: GET https://app.phoenix.arize.com/s/cline-priyan/v1/datasets/RGF0YXNldDoxNA%3D%3D/examples?version_id=RGF0YXNldFZlcnNpb246MTQ%3D "HTTP/1.1 200 OK"
2025-10-10 17:48:06,612 - phoenix.client.resources.datasets - INFO - Dataset uploaded successfully. ID: RGF0YXNldDoxNA==, Version: RGF0YXNldFZlcnNpb246MTQ=
2025-10-10 17:48:06,636 - phoenix.client.resources.datasets - INFO - Uploading dataset...
2025-10-10 17:48:06,748 - httpx - INFO - HTTP Request: POST https://app.phoenix.arize.com/s/cline-priyan/v1/datasets/upload?sync=true "HTTP/1.1 409 Conflic

DatasetUploadError: Dataset upload failed: Dataset with the same name already exists: name='Cline Act Mode: SWE-bench Test'

## Helper: Log Experiments to Arize

This helper function logs experiment results to Arize, allowing us to visualize and track optimization progress across iterations.


In [62]:
import importlib
import phoenix_experiments
importlib.reload(phoenix_experiments)
from phoenix_experiments import log_experiment_to_phoenix

## Ruleset Optimization Loop

This is the main optimization loop. For each iteration:

1. **Run Cline in Act Mode on training set** with the current ruleset, generating actual code patches
2. **Run Cline in Act Mode on test set** with the current ruleset to measure generalization
3. **Run SWE-bench tests** to validate patches and compute pass/fail metrics
4. **Evaluate results** using LLM-as-judge to provide detailed feedback on patch quality
5. **Optimize the ruleset** using Prompt Learning based on training results and feedback
6. **Save results and rulesets** for tracking and analysis

The optimization loop uses actual test execution results (pass/fail) as ground truth, combined with LLM evaluator feedback to iteratively improve the ruleset.


In [None]:
ruleset = ""

for loop in range(LOOPS):
    print(f"Running for loop: {loop}")

    train_run_id = f"claude_150_train_{loop}"
    test_run_id = f"150_act_test_{loop}"

    train_df = run_act(dataset_name=dataset_name, instance_ids=train_ids, run_id=train_run_id, ruleset=ruleset, workers=WORKERS)
    test_df = run_act(dataset_name=dataset_name, instance_ids=test_ids, run_id=test_run_id, ruleset=ruleset, workers=WORKERS)

    test_df.to_csv(f"act_results/test_results_{loop}.csv", index=False)
    
    train_acc = sum(train_df["pass_or_fail"] == "pass") / len(train_df)
    test_acc = sum(test_df["pass_or_fail"] == "pass") / len(test_df)
    print(f"Train Accuracy: {train_acc}")
    # print(f"Test Accuracy: {test_acc}")

    # make sure any swebench package installations did not affect phoenix package
    subprocess.run([
        "/opt/anaconda3/envs/cline/bin/python3",
        "-m",
        "pip",
        "install",
        "--upgrade",
        "arize-phoenix",
        "wrapt",
    ])
    evaluated_train_results = evaluate_results(train_df)
    evaluated_train_results.to_csv(f"act_results/testing_train_results_{loop}.csv", index=False)
    evaluated_train_results["score"] = [1.0 if x == "correct" else 0.0 for x in evaluated_train_results["correctness"]]
    
    # Log experiment to Phoenix using REST API
    log_experiment_to_phoenix(
        hostname=HOSTNAME,
        api_key=os.getenv("PHOENIX_API_KEY"),
        dataset_obj=train_dataset,
        experiment_name=f"Train {loop}",
        experiment_df=evaluated_train_results,
        metadata={
            "loop": loop,
            "train_accuracy": train_acc,
            "test_accuracy": test_acc,
            "train_size": TRAIN_SIZE,
            "test_size": TEST_SIZE
        }
    )

    pl_optimizer = PromptLearningOptimizer(
        prompt=CLINE_PROMPT,
        model_choice="gpt-5",
        openai_api_key=os.getenv("OPENAI_API_KEY")
    )
    ruleset = pl_optimizer.optimize(
        dataset=evaluated_train_results,
        output_column="cline_patch",
        feedback_columns=["correctness", "explanation"],
        ruleset=ruleset,
        context_size_k=400000
    )
    with open(f"act_rulesets/ruleset_{loop}.txt", "w") as f:
        f.write(f"train_accuracy: {train_acc} \n")
        f.write(f"test_accuracy: {test_acc} \n")
        f.write(f"optimized ruleset_{loop}: \n {ruleset} \n")



In [None]:
def create_pd(numcorrect):
    train_pd_1 = train_pd.copy()
    train_pd_1["cline_patch"] = train_pd_1["test_patch"]
    train_pd_1["correctness"] = ["correct"] * numcorrect + ["incorrect"] * (150 - numcorrect)
    train_pd_1["explanation"] = ["explanation"] * 150
    return train_pd_1

train_pd_1 = create_pd(28)
train_pd_2 = create_pd(30)
train_pd_3 = create_pd(51)
train_pd_4 = create_pd(50)
train_pd_5 = create_pd(51)


In [None]:


for experiment_csv in ["act_results/train_results_0.csv", "act_results/train_results_1.csv", "act_results/train_results_2.csv", "act_results/train_results_3.csv", "act_results/train_results_4.csv"]:
    loop += 1
    experiment_df = pd.read_csv(experiment_csv)
    log_experiment_to_phoenix(
        hostname=HOSTNAME,
        api_key=os.getenv("PHOENIX_API_KEY"),
        dataset_obj=train_dataset,
        experiment_name=f"Train {loop}",
        experiment_df=experiment_df,
        metadata={
            "loop": loop,
            "train_accuracy": train_acc,
            "train_size": TRAIN_SIZE,
        }
    )

ParserError: Error tokenizing data. C error: Expected 8 fields in line 150, saw 9
