# Problem Statement

Current benchmarks like SWE-bench and SWE-bench Verified evaluate agents on a diverse set of real-world engineering tasks. However, they lack any structured representation of the task space itself. As a result, these headline benchmarks function more as scoreboards than as frameworks for systematically guiding agent development. Without insight into how tasks are distributed or how they span the feature space, it becomes difficult to diagnose where and why agents succeed or fail — limiting the benchmark's utility for capability-driven progress.

# Research Question

Can we map the feature space of benchmark tasks using clustering techniques for mixed-type data, in order to better understand agent successes and failures? Can this structure then be used to both guide the development of specific agent capabilities and more accurately measure agent performance?

# Supported Sections: Results Unified Spherical Clustering
This notebook supports Unified Spherical Clustering results in the Results section.

In [None]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.metrics import silhouette_score
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from google.colab import files
from sklearn.preprocessing import normalize
from sentence_transformers import SentenceTransformer
from sklearn.metrics import normalized_mutual_info_score


# Analysing the Annotation Dataset

In [None]:
df_annotate = pd.read_csv("ensembled_annotations_public.csv")


In [None]:
df_annotate.head()

# SWE-Bench Verified Dataset

In [None]:
##dataset downloaded from huggingface

df=pd.read_csv("swe_bench_verified_test.csv")
df_merged = df.merge(df_annotate, on='instance_id', how='left')

In [None]:
df_merged.head()

In [None]:
#dropping duplicate columns and columns that provide the index number of people who made a given decision
cols_to_drop = ["difficulty_y"] + [col for col in df_merged.columns if col.endswith('_decided_by')] + \
              ["filter_out"] + ["difficulty_ensemble_decision_procedure"] + \
               ["other_notes"] +["other_major_issues"] + ["instance_id"] +["created_at"]+["version"] + \
               ["environment_setup_commit"] + ["base_commit"] +["test_patch"]
df_cleaned = df_merged.drop(columns = cols_to_drop)
df_cleaned = df_cleaned.rename(columns = {"difficulty_x":"difficulty"})

In [None]:
df_cleaned.head()

# Feature Extraction From Patches

To prepare the dataset for clustering, I will extract the structural features from each task's associated pull request using the PatchSet class from the unidiff Python library.

A patch here refers to a code change and is also commonly referred to as a  diff — the code changes proposed to solve the benchmark task. PatchSet parses the unified diff format and allows for the extraction of a number of low-level structural signals from each patch.

The features extracted were:

- The number of files changed

- The number of hunks in a given code change (i.e., blocks of contiguous edits)

- The number of lines of code added

- The number of lines of code removed

These features serve as numerical proxies for task size and task complexity and provide a rough measure of how involved a code change is. From these derived features will be extracted also. This will be important later for mapping to agent capabilities like localisation (where in the code to make the edits) and planning (how many steps are needed).


In [None]:
!pip install unidiff

from unidiff import PatchSet


In [None]:
def extract_patch_features(patch_text: str) -> pd.Series:
    """Function to extract basic numerical features from patch and compute derived features."""
    try:
        patch = PatchSet(patch_text)

        #Basic features
        num_files = len(patch)
        num_hunks = sum(len(file) for file in patch)
        lines_added = sum(hunk.added for file in patch for hunk in file)
        lines_removed = sum(hunk.removed for file in patch for hunk in file)

        #Derived Features
        total_lines_changed = lines_added + lines_removed
        change_ratio = np.log1p(lines_added) / (np.log1p(lines_removed) + 1)

        #Change Concentration
        changes_per_file = [
            sum(hunk.added + hunk.removed for hunk in file)
            for file in patch
        ]

        if changes_per_file:
            max_file_change = max(changes_per_file)
            mean_change = np.mean(changes_per_file)
            change_concentration = max_file_change / (total_lines_changed + 1)
            change_spread = np.std(changes_per_file) / (mean_change + 1)
        else:
            max_file_change = change_concentration = change_spread = 0

        return pd.Series([
            num_files, num_hunks, lines_added, lines_removed,
            total_lines_changed, change_ratio,
            max_file_change, change_concentration, change_spread
        ])

    except Exception:
        return pd.Series([None] * 9)


target_columns = [
    "num_files", "num_hunks", "lines_added", "lines_removed",
    "total_lines_changed", "change_ratio",
    "max_file_change", "change_concentration","change_spread"
]

df_cleaned[target_columns] = df_cleaned["patch"].apply(extract_patch_features)


In [None]:
#Full feature set
comprehensive_numerical_cols = [
    "num_files", "num_hunks", "lines_added", "lines_removed",
    "total_lines_changed", "change_ratio",
    "max_file_change", "change_concentration", "change_spread"
]

#Selected features for SWE-bench verified clustering
selected_numerical_cols = [
    "num_files",               # Multi-file changes
    "num_hunks",               # Complexity indicator
    "total_lines_changed",     # Overall patch size
    "change_ratio",            # Add/removed line balance
    "max_file_change",         # Concentration of changes
    "change_concentration"     # How focused the changes are
]

In [None]:
#Dropping redundant features
cols_to_drop_lowentropy = ["lines_added"] + ["lines_removed"] + ["change_spread"] \

df_cleaned = df_cleaned.drop(columns = cols_to_drop_lowentropy)

In [None]:
df_cleaned.head(5)

In [None]:
df_cleaned.columns

# Unified Clustering


#### Creating the numerical, ordinal, categorical and text embeddings for the unified clustering approach

### Numerical

In [None]:
#scaling the numerical features
numerical_cols = ["num_files", "num_hunks", "total_lines_changed",
                  "change_ratio", "max_file_change", "change_concentration"]
scaler = RobustScaler()
numerical_features = scaler.fit_transform(df_cleaned[numerical_cols])

### Ordinal and Categorical

In [None]:
print("Unique difficulty values:", df_cleaned["difficulty"].unique())

In [None]:
print("Unique underspecified values:", df_cleaned['underspecified'].unique())

In [None]:
difficulty_map = {"<15 min fix": 0, "15 min - 1 hour": 1, "1-4 hours": 2, ">4 hours": 3}
difficulty_features = df_cleaned['difficulty'].map(difficulty_map).values.reshape(-1, 1)
underspecified_features = df_cleaned['underspecified'].values.reshape(-1, 1).astype(float)

## Text embeddings using BERT

In [None]:
def extract_text_features(df, include_repo_context=True):
    """This function extracts BERT embeddings from the problem statements with adding optional repo context"""

    model = SentenceTransformer('all-MiniLM-L6-v2')

    if include_repo_context:
        #repo context mapping
        repo_context = {
            'astropy/astropy': 'astronomy scientific computing',
            'django/django': 'web framework backend',
            'matplotlib/matplotlib': 'data visualization plotting',
            'mwaskom/seaborn': 'statistical visualization',
            'pallets/flask': 'web microframework',
            'psf/requests': 'HTTP library networking',
            'pydata/xarray': 'multidimensional arrays',
            'pylint-dev/pylint': 'code analysis linting',
            'pytest-dev/pytest': 'testing framework',
            'scikit-learn/scikit-learn': 'machine learning',
            'sphinx-doc/sphinx': 'documentation generator',
            'sympy/sympy': 'symbolic mathematics'
        }

        #creating domain-aware problem statements
        df['repo_context'] = df['repo'].map(repo_context).fillna(df['repo'].str.split('/').str[-1])
        problem_statements = ('[' + df['repo_context'] + '] ' + df['problem_statement'].fillna('')).tolist()

    text_embeddings = model.encode(problem_statements, show_progress_bar=True)

    return text_embeddings

#extracting with repo context
text_features = extract_text_features(df_cleaned, include_repo_context=True)

### Combining features

In [None]:
#Combining all features
combined_features = np.hstack([
    numerical_features,
    difficulty_features,
    underspecified_features,
    text_features
])

print(f"Numerical shape: {numerical_features.shape}")
print(f"Difficulty shape: {difficulty_features.shape}")
print(f"Underspecified shape: {underspecified_features.shape}")
print(f"Text shape: {text_features.shape}")



In [None]:
#normalising for spherical clustering
spherical_features = normalize(combined_features, norm='l2')
print(f"Normalized features shape: {spherical_features.shape}")

In [None]:
#Elbow method and silhouette analysis
k_range = range(2, 21)  #testing k from 2-20
wcss = []
silhouette_scores = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=50)
    clusters = kmeans.fit_predict(spherical_features)

    wcss.append(kmeans.inertia_)
    sil_score = silhouette_score(spherical_features, clusters)
    silhouette_scores.append(sil_score)

#plots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(k_range, wcss, 'bo-')
ax1.set_xlabel('Number of clusters (k)',fontsize = 12)
ax1.set_ylabel('WCSS', fontsize = 12)
ax1.set_title('Elbow Method')

ax2.plot(k_range, silhouette_scores, 'ro-')
ax2.set_xlabel('Number of clusters (k)',fontsize = 12)
ax2.set_ylabel('Silhouette Score',fontsize = 12)
ax2.set_title('Silhouette Analysis')

plt.tight_layout()
plt.show()

In [None]:
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

k_range = range(2, 21)
metrics_for_thesis = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=50)
    labels = kmeans.fit_predict(spherical_features)

    sil = silhouette_score(spherical_features, labels)
    db = davies_bouldin_score(spherical_features, labels)

    metrics_for_thesis.append({
        'k': k,
        'silhouette': sil,
        'davies_bouldin': db
    })

#identification of best k
best_k_sil = max(metrics_for_thesis, key=lambda x: x['silhouette'])['k']
best_k_db = min(metrics_for_thesis, key=lambda x: x['davies_bouldin'])['k']

print(f"Best k by Silhouette: {best_k_sil}")
print(f"Best k by Davies-Bouldin: {best_k_db}")

In [None]:
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, fowlkes_mallows_score
from scipy.stats import entropy

def evaluate_clustering_for_thesis(features, labels, df, k):
    """Function to calculate the cluster validation metrics"""
    results = {}

    #1. Connectivity
    from sklearn.neighbors import NearestNeighbors
    nn = NearestNeighbors(n_neighbors=5)
    nn.fit(features)
    distances, indices = nn.kneighbors(features)

    connectivity = 0
    for i, neighbors in enumerate(indices):
        same_cluster = sum(labels[n] == labels[i] for n in neighbors[1:])
        connectivity += same_cluster / (len(neighbors) - 1)
    results['connectivity'] = connectivity / len(features)

    #2. Dunn Index
    from sklearn.metrics.pairwise import pairwise_distances
    unique_labels = np.unique(labels)

    inter_distances = []
    for i in range(len(unique_labels)):
        for j in range(i+1, len(unique_labels)):
            mask_i = labels == unique_labels[i]
            mask_j = labels == unique_labels[j]
            dist = pairwise_distances(features[mask_i], features[mask_j]).min()
            inter_distances.append(dist)

    intra_distances = []
    for label in unique_labels:
        mask = labels == label
        if mask.sum() > 1:
            cluster_features = features[mask]
            dist = pairwise_distances(cluster_features).max()
            intra_distances.append(dist)

    results['dunn_index'] = min(inter_distances) / max(intra_distances) if inter_distances and intra_distances else 0

    #3. Repo NMI
    repo_labels = pd.Categorical(df['repo']).codes
    results['repo_nmi'] = normalized_mutual_info_score(labels, repo_labels)

    #4. Specification NMI
    spec_labels = df['underspecified'].astype(int).values
    results['spec_nmi'] = normalized_mutual_info_score(labels, spec_labels)

    # 5. Difficulty NMI
    diff_map = {'<15 min fix': 0, '15 min - 1 hour': 1, '1-4 hours': 2, '>4 hours': 3}
    diff_labels = df['difficulty'].map(diff_map).values
    results['diff_nmi'] = normalized_mutual_info_score(labels, diff_labels)


    #5. Stability
    stability_scores = []
    for seed in range(42, 52):  #10 runs
        kmeans_temp = KMeans(n_clusters=k, random_state=seed, n_init=10)
        labels_temp = kmeans_temp.fit_predict(features)
        stability_scores.append(adjusted_rand_score(labels, labels_temp))

    results['stability'] = np.mean(stability_scores)
    results['stability_std'] = np.std(stability_scores)

    return results


results_list = []

for k in [2, 3, 4, 5, 6, 7, 8, 9, 10]:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=50)
    labels = kmeans.fit_predict(spherical_features)
    metrics = evaluate_clustering_for_thesis(spherical_features, labels, df_cleaned, k)

    results_list.append({
        'K': k,
        'Connectivity': f"{metrics['connectivity']:.1%}",
        'Dunn Index': f"{metrics['dunn_index']:.3f}",
        'Specification NMI': f"{metrics['spec_nmi']:.3f}",
        'Difficulty NMI': f"{metrics['diff_nmi']:.3f}",
        'Repository NMI': f"{metrics['repo_nmi']:.3f}",
        'Stability': f"{metrics['stability']:.3f} ± {metrics['stability_std']:.3f}"
    })

#creating df
metrics_df = pd.DataFrame(results_list)
print("\n=== Alternative Clustering Metrics ===")
metrics_df


In [None]:
#cluster sizes with k=6
spherical_kmeans = KMeans(n_clusters=6, random_state=42, n_init=50)
unified_clusters = spherical_kmeans.fit_predict(spherical_features)

df_cleaned['unified_cluster'] = unified_clusters
print("Cluster distribution:")
print(df_cleaned['unified_cluster'].value_counts().sort_index())

In [None]:
from sklearn.manifold import TSNE

#creating t-SNE visualisation
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_results = tsne.fit_transform(spherical_features)

#plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(tsne_results[:, 0], tsne_results[:, 1],
                     c=df_cleaned['unified_cluster'],
                     cmap='tab10',
                     alpha=0.6,
                     s=50)

plt.colorbar(scatter, label='Cluster')
plt.xlabel('t-SNE Component 1', fontsize=12)
plt.ylabel('t-SNE Component 2',fontsize=12)
plt.title('t-SNE Visualisation of Unified Spherical Clusters (K=6)')

#adding cluster centers (approximates)
for i in range(6):
    cluster_points = tsne_results[df_cleaned['unified_cluster'] == i]
    center_x = cluster_points[:, 0].mean()
    center_y = cluster_points[:, 1].mean()
    plt.annotate(f'C{i}', (center_x, center_y),
                fontsize=12, fontweight='bold',
                bbox=dict(boxstyle="round,pad=0.3", facecolor="white", alpha=0.8))

plt.tight_layout()
plt.show()

In [None]:
#analysing cluster composition by repos
cluster_repo_dist = pd.crosstab(df_cleaned['repo'], df_cleaned['unified_cluster'])
print("Cluster composition by repository:")
print(cluster_repo_dist)

#checking if clusters align with domains
for cluster in range(6):
    mask = df_cleaned['unified_cluster'] == cluster
    repos = df_cleaned[mask]['repo'].value_counts()
    print(f"\nCluster {cluster} top repositories:")
    print(repos.head())

In [None]:
def create_cluster_summary_table(df):
    """function creates summary table using mean values"""
    results = []

    for cluster_num in range(6):
        subset = df[df['unified_cluster'] == cluster_num]

        results.append({
            'Cluster': cluster_num,
            'Size': len(subset),
            'Files': subset['num_files'].mean(),
            'Hunks': subset['num_hunks'].mean(),
            'Total_lines': subset['total_lines_changed'].mean(),
            'Change_ratio': subset['change_ratio'].mean(),
            'Max_file': subset['max_file_change'].mean(),
            'Change_conc': subset['change_concentration'].mean(),
            'Underspec_%': subset['underspecified'].mean() * 100,
        })

    return pd.DataFrame(results)

summary_table = create_cluster_summary_table(df_cleaned)
summary_table

In [None]:
def create_difficulty_distribution_table(df):
    """Function to create a table showing difficulty distribution across clusters"""

    difficulty_order = ['<15 min fix', '15 min - 1 hour', '1-4 hours', '>4 hours']
    results = []

    for cluster_num in range(6):
        subset = df[df['unified_cluster'] == cluster_num]
        difficulty_counts = subset['difficulty'].value_counts()

        row = {'Cluster': cluster_num, 'Size': len(subset)}
        #adding counts and percentages
        for difficulty_level in difficulty_order:
            count = difficulty_counts.get(difficulty_level, 0)
            percentage = (count / len(subset)) * 100 if len(subset) > 0 else 0
            row[difficulty_level] = f"{count} ({percentage:.0f}%)"

        #adding primary difficulty
        row['Primary'] = difficulty_counts.index[0] if len(difficulty_counts) > 0 else "N/A"
        results.append(row)

    diff_df = pd.DataFrame(results)

    #formatted table
    print("\nDifficulty Distribution Across Clusters")
    print("="*120)
    print(f"{'Cluster':<8} {'Size':<6} {'<15 min':<15} {'15min-1h':<15} {'1-4h':<15} {'>4h':<15} {'Primary':<20}")
    print("-"*120)

    for _, row in diff_df.iterrows():
        print(f"{row['Cluster']:<8} {row['Size']:<6} {row['<15 min fix']:<15} {row['15 min - 1 hour']:<15} "
              f"{row['1-4 hours']:<15} {row['>4 hours']:<15} {row['Primary']:<20}")

    return diff_df

difficulty_table = create_difficulty_distribution_table(df_cleaned)

In [None]:
def extract_cluster_examples_for_table(df, cluster_num):
    """This function extracts full problem details for manual selection"""
    subset = df[df['unified_cluster'] == cluster_num]

    print(f"\n{'='*80}")
    print(f"CLUSTER {cluster_num} - Full Examples")
    print(f"{'='*80}")

    #getting representative examples from different difficulty levels
    difficulty_counts = subset['difficulty'].value_counts()

    for difficulty, count in difficulty_counts.items():
        pct = count/len(subset)*100
        if pct < 15:  #skipping very minor categories
            continue

        print(f"\n--- {difficulty} ({count} tasks, {pct:.0f}%) ---")

        difficulty_data = subset[subset['difficulty'] == difficulty]
        samples = difficulty_data.sample(n=min(5, len(difficulty_data)), random_state=42)  #getting up to 5 examples

        for i, (idx, row) in enumerate(samples.iterrows(), 1):
            spec_status = "UNDERSPECIFIED" if row['underspecified'] else "WELL-SPECIFIED"
            print(f"\n{i}. [{spec_status}] {row['repo']}")
            print(f"   Lines changed: {int(row['total_lines_changed'])}")
            print(f"   Problem: {row['problem_statement']}")
            print("-" * 40)

#examples for all clusters
for cluster_num in range(6):
    extract_cluster_examples_for_table(df_cleaned, cluster_num)

## BERT ONLY Clusters

In [None]:
#Extracting just BERT embeddings with repo context

#repos context mapping (same as before)
repo_context = {
    'astropy/astropy': 'astronomy scientific computing',
    'django/django': 'web framework backend',
    'matplotlib/matplotlib': 'data visualization plotting',
    'mwaskom/seaborn': 'statistical visualization',
    'pallets/flask': 'web microframework',
    'psf/requests': 'HTTP library networking',
    'pydata/xarray': 'multidimensional arrays',
    'pylint-dev/pylint': 'code analysis linting',
    'pytest-dev/pytest': 'testing framework',
    'scikit-learn/scikit-learn': 'machine learning',
    'sphinx-doc/sphinx': 'documentation generator',
    'sympy/sympy': 'symbolic mathematics'
}

#creating domain-aware problem statements
df['repo_context'] = df['repo'].map(repo_context).fillna(df['repo'].str.split('/').str[-1])
problem_statements = ('[' + df['repo_context'] + '] ' + df['problem_statement'].fillna('')).tolist()

#embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
bert_only_embeddings = model.encode(df_cleaned['domain_aware_statement'].tolist(), show_progress_bar=True)

#normalising for spherical clustering
bert_spherical = normalize(bert_only_embeddings, norm='l2')

print(f"BERT-only embeddings shape: {bert_spherical.shape}")

#testing different k values
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

k_range = range(2, 21)
silhouette_scores = []
wcss = []

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=50)
    labels = kmeans.fit_predict(bert_spherical)

    sil_score = silhouette_score(bert_spherical, labels)
    silhouette_scores.append(sil_score)
    wcss.append(kmeans.inertia_)

    print(f"k={k}: Silhouette={sil_score:.4f}, WCSS={kmeans.inertia_:.2f}")

#results
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(k_range, wcss, 'bo-')
ax1.set_xlabel('Number of clusters (k)')
ax1.set_ylabel('WCSS')
#ax1.set_title('Elbow Method - BERT Only')

ax2.plot(k_range, silhouette_scores, 'ro-')
ax2.set_xlabel('Number of clusters (k)')
ax2.set_ylabel('Silhouette Score')
#ax2.set_title('Silhouette Analysis - BERT Only')

plt.tight_layout()
plt.show()

#checking specification NMI for key k values
print("\n=== Checking BERT-only Specification NMI ===")
for k in [2, 3, 4, 5, 6, 7, 8, 9, 10]:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=50)
    labels = kmeans.fit_predict(bert_spherical)

    spec_nmi = normalized_mutual_info_score(labels, df_cleaned['underspecified'].astype(int))
    print(f"k={k}: Specification NMI = {spec_nmi:.3f}")

In [None]:
#checking if BERT clusters align with repositories instead
from sklearn.metrics import normalized_mutual_info_score

print("=== BERT-only Repository Alignment ===")
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=50)
    labels = kmeans.fit_predict(bert_spherical)

    #repo labels
    repo_labels = pd.Categorical(df_cleaned['repo']).codes

    repo_nmi = normalized_mutual_info_score(labels, repo_labels)
    print(f"k={k}: Repository NMI = {repo_nmi:.3f}")

#check average pairwise similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(bert_spherical)
avg_similarity = (similarities.sum() - len(similarities)) / (len(similarities) * (len(similarities) - 1))
print(f"\nAverage pairwise cosine similarity: {avg_similarity:.3f}")

In [None]:
#checking repos alignment for full k=6 clustering
repo_labels = pd.Categorical(df_cleaned['repo']).codes
repo_nmi_full = normalized_mutual_info_score(df_cleaned['unified_cluster'], repo_labels)
print(f"Full model (k=6) Repository NMI: {repo_nmi_full:.3f}")

#cross-tab
repo_dist = pd.crosstab(df_cleaned['repo'], df_cleaned['unified_cluster'], normalize='columns')
print("\nRepository distribution across clusters (column %):")
print(repo_dist.round(2))