# Evolver Loop 2 Analysis

## Situation Assessment

**Current Status:**
- Best CV score: 70.676102 (from ensemble.csv)
- Best LB score: N/A (first submission failed due to overlaps)
- Target: 68.919154
- Gap: 1.756948 points (~2.5% improvement needed)

**Key Observations:**
1. The pre-optimized solution is at a strong local optimum
2. Multiple C++ optimizers (bbox3, tree_packer, backward propagation, fractional translation) found NO improvement
3. All available pre-optimized solutions have the same best configurations for each N
4. The first submission failed because best_ensemble.csv has overlaps detected by Kaggle

**Critical Question:** How do we escape this local optimum to reach the target?

In [None]:
import pandas as pd
import numpy as np
import os

# Load the current best solution
base_path = '/home/nonroot/snapshots/santa-2025/21116303805/code/preoptimized'
df = pd.read_csv(f'{base_path}/ensemble.csv')

# Parse values
def parse_s_value(s):
    if isinstance(s, str) and s.startswith('s'):
        return float(s[1:])
    return float(s)

df['x_val'] = df['x'].apply(parse_s_value)
df['y_val'] = df['y'].apply(parse_s_value)
df['deg_val'] = df['deg'].apply(parse_s_value)

print(f"Loaded {len(df)} rows")
print(df.head())

In [None]:
# Analyze score contribution by N
from shapely.geometry import Polygon
from shapely import affinity

TX = [0, 0.125, 0.0625, 0.2, 0.1, 0.35, 0.075, 0.075, -0.075, -0.075, -0.35, -0.1, -0.2, -0.0625, -0.125]
TY = [0.8, 0.5, 0.5, 0.25, 0.25, 0, 0, -0.2, -0.2, 0, 0, 0.25, 0.25, 0.5, 0.5]
TREE_VERTICES = list(zip(TX, TY))

def create_tree_polygon(x, y, deg):
    poly = Polygon(TREE_VERTICES)
    poly = affinity.rotate(poly, deg, origin=(0, 0))
    poly = affinity.translate(poly, x, y)
    return poly

def get_bounding_box_side(polygons):
    if not polygons:
        return 0
    all_coords = []
    for poly in polygons:
        all_coords.extend(list(poly.exterior.coords))
    xs = [c[0] for c in all_coords]
    ys = [c[1] for c in all_coords]
    return max(max(xs) - min(xs), max(ys) - min(ys))

# Calculate score per N
scores_per_n = []
for n in range(1, 201):
    prefix = f'{n:03d}_'
    group = df[df['id'].str.startswith(prefix)]
    if len(group) == 0:
        continue
    
    polygons = []
    for _, row in group.iterrows():
        poly = create_tree_polygon(row['x_val'], row['y_val'], row['deg_val'])
        polygons.append(poly)
    
    side = get_bounding_box_side(polygons)
    score = side**2 / n
    scores_per_n.append({'n': n, 'side': side, 'score': score, 'trees': len(group)})

scores_df = pd.DataFrame(scores_per_n)
print(f"Total score: {scores_df['score'].sum():.6f}")
print(f"\nTop 10 score contributors:")
print(scores_df.nlargest(10, 'score')[['n', 'side', 'score']])

In [None]:
# Analyze the score distribution
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Score by N
axes[0, 0].bar(scores_df['n'], scores_df['score'], alpha=0.7)
axes[0, 0].set_xlabel('N')
axes[0, 0].set_ylabel('Score contribution')
axes[0, 0].set_title('Score contribution by N')

# Side length by N
axes[0, 1].scatter(scores_df['n'], scores_df['side'], alpha=0.5, s=10)
axes[0, 1].set_xlabel('N')
axes[0, 1].set_ylabel('Bounding box side')
axes[0, 1].set_title('Bounding box side by N')

# Cumulative score
scores_df_sorted = scores_df.sort_values('score', ascending=False)
scores_df_sorted['cumsum'] = scores_df_sorted['score'].cumsum()
axes[1, 0].plot(range(1, len(scores_df_sorted)+1), scores_df_sorted['cumsum'])
axes[1, 0].axhline(y=68.919154, color='r', linestyle='--', label='Target')
axes[1, 0].set_xlabel('Number of N values (sorted by score)')
axes[1, 0].set_ylabel('Cumulative score')
axes[1, 0].set_title('Cumulative score (sorted by contribution)')
axes[1, 0].legend()

# Score vs theoretical minimum (side = sqrt(n) * tree_area)
tree_area = Polygon(TREE_VERTICES).area
scores_df['theoretical_min_side'] = np.sqrt(scores_df['n'] * tree_area)
scores_df['efficiency'] = scores_df['theoretical_min_side'] / scores_df['side']
axes[1, 1].scatter(scores_df['n'], scores_df['efficiency'], alpha=0.5, s=10)
axes[1, 1].set_xlabel('N')
axes[1, 1].set_ylabel('Packing efficiency')
axes[1, 1].set_title('Packing efficiency by N (higher is better)')

plt.tight_layout()
plt.savefig('/home/code/exploration/score_analysis.png', dpi=100)
plt.show()

print(f"\nTree area: {tree_area:.6f}")
print(f"Average efficiency: {scores_df['efficiency'].mean():.4f}")
print(f"\nLowest efficiency N values (most room for improvement):")
print(scores_df.nsmallest(10, 'efficiency')[['n', 'side', 'score', 'efficiency']])

In [None]:
# Calculate how much improvement is needed
target = 68.919154
current = scores_df['score'].sum()
gap = current - target

print(f"Current score: {current:.6f}")
print(f"Target score: {target:.6f}")
print(f"Gap to close: {gap:.6f}")
print(f"Percentage improvement needed: {100 * gap / current:.2f}%")

# If we could improve efficiency uniformly
print(f"\nIf we could improve all N by {100 * gap / current:.2f}%:")
print(f"  New score would be: {current * (1 - gap/current):.6f}")

# What if we could match the best efficiency for all N?
best_efficiency = scores_df['efficiency'].max()
print(f"\nBest efficiency achieved: {best_efficiency:.4f} at N={scores_df.loc[scores_df['efficiency'].idxmax(), 'n']}")
print(f"If all N had this efficiency, score would be: {sum(scores_df['theoretical_min_side']**2 / scores_df['n'] / best_efficiency**2):.6f}")

In [None]:
# Analyze which N values have the most potential for improvement
# based on their current efficiency vs best efficiency

scores_df['potential_improvement'] = scores_df['score'] * (1 - (scores_df['efficiency'] / best_efficiency)**2)
scores_df['potential_new_score'] = scores_df['score'] - scores_df['potential_improvement']

print("N values with most improvement potential (if they matched best efficiency):")
print(scores_df.nlargest(15, 'potential_improvement')[['n', 'score', 'efficiency', 'potential_improvement']])

print(f"\nTotal potential improvement: {scores_df['potential_improvement'].sum():.6f}")
print(f"Potential new score: {scores_df['potential_new_score'].sum():.6f}")

## Key Insights

1. **The gap is ~1.76 points** - This is a significant improvement (~2.5%)

2. **Small N values contribute most to the score** - N=1,2,3 alone contribute ~1.5 points

3. **The pre-optimized solution is at a strong local optimum** - Multiple optimizers found no improvement

4. **Different N values have different packing efficiencies** - Some N values are packed more efficiently than others

## Strategic Options

### Option 1: Submit the overlap-free baseline first
- Use ensemble.csv which is verified overlap-free
- This establishes a baseline LB score
- **Priority: HIGHEST** - We need a valid submission!

### Option 2: Focus on low-efficiency N values
- Identify N values with worst packing efficiency
- Run longer optimization specifically on these N values
- Use different starting configurations

### Option 3: Try fundamentally different approaches
- Genetic algorithms with population diversity
- Different initial placements (not from pre-optimized)
- Constructive heuristics from scratch

### Option 4: Longer optimization runs
- The bbox3 runner kernel uses 3+ hours
- Our runs were only minutes
- Longer runs might find improvements

In [None]:
# Check what the submission file currently contains
submission_path = '/home/submission/submission.csv'
if os.path.exists(submission_path):
    sub_df = pd.read_csv(submission_path)
    print(f"Submission file has {len(sub_df)} rows")
    print(sub_df.head())
    
    # Verify it matches ensemble.csv
    ensemble_df = pd.read_csv(f'{base_path}/ensemble.csv')
    if sub_df['id'].equals(ensemble_df['id']) and sub_df['x'].equals(ensemble_df['x']):
        print("\n✓ Submission matches ensemble.csv")
    else:
        print("\n⚠ Submission differs from ensemble.csv")
else:
    print("No submission file found")