# Loop 18 Strategic Analysis

## Current Situation
- **Best LB**: 70.3535 (exp_016)
- **Target**: 68.877877
- **Gap**: 1.48 points (2.1%)

## Key Insight from Evaluator
exp_017 confirmed we've EXHAUSTED the ensemble approach with current data:
- 0 improvements found over exp_016
- 17,543 improvements rejected as too small (< 0.001)
- 2,410 rejected for overlaps

## What Top Kernels Do Differently
1. **3-hour bbox3 runs** with n=1000-2000, r=30-90
2. **fix_direction()** - rotates entire configuration to minimize bbox
3. **17+ external data sources** (we have ~10)

In [None]:
# Let's analyze what we're missing
import pandas as pd
import numpy as np
import os
import glob

# Count our external sources
external_files = glob.glob('/home/code/external_data/**/*.csv', recursive=True)
print(f"External CSV files: {len(external_files)}")
for f in external_files:
    print(f"  {f.replace('/home/code/external_data/', '')}")

# Count unique sources (dedupe by content hash)
import hashlib
unique_hashes = set()
for f in external_files:
    try:
        with open(f, 'rb') as file:
            h = hashlib.md5(file.read()).hexdigest()
            unique_hashes.add(h)
    except:
        pass
print(f"\nUnique external sources: {len(unique_hashes)}")
print(f"Top kernels use: 17-19 sources")

In [None]:
# Analyze per-N scores to find where we're weakest
import sys
sys.path.insert(0, '/home/code')
from code.tree_geometry import calculate_score
from code.utils import parse_submission

# Load our best submission
df = pd.read_csv('/home/code/experiments/016_mega_ensemble_external/submission.csv')
configs = parse_submission(df)

# Calculate per-N scores
per_n_scores = {}
for n in range(1, 201):
    per_n_scores[n] = calculate_score(configs[n])

# Find top contributors to total score
sorted_by_score = sorted(per_n_scores.items(), key=lambda x: x[1], reverse=True)
print("Top 20 N values by score contribution:")
for n, score in sorted_by_score[:20]:
    print(f"  N={n}: {score:.6f}")

print(f"\nTotal score: {sum(per_n_scores.values()):.6f}")
print(f"Top 20 contribute: {sum(s for n,s in sorted_by_score[:20]):.6f}")
print(f"Percentage: {sum(s for n,s in sorted_by_score[:20])/sum(per_n_scores.values())*100:.1f}%")

In [None]:
# Analyze the 17,543 rejected improvements
# These are improvements < 0.001 that we can't safely use
# But what if we could use them for SOME N values?

# Let's see which N values have NEVER caused Kaggle failures
# From session_state: failures were in N=2, 89, 123
failed_n_values = {2, 89, 123}  # From exp_000, exp_009, exp_013

print("N values that have caused Kaggle failures:")
print(failed_n_values)

print("\nN values that have NEVER failed (potential safe zones):")
safe_n = set(range(1, 201)) - failed_n_values
print(f"Count: {len(safe_n)} out of 200")

In [None]:
# Key insight: fix_direction() can improve scores by rotating the ENTIRE configuration
# This is different from rotating individual trees
# Let's implement this!

from scipy.optimize import minimize_scalar
from scipy.spatial import ConvexHull

def calculate_bbox_side_at_angle(angle_deg, points):
    """Calculate bbox side length when configuration is rotated by angle_deg"""
    angle_rad = np.radians(angle_deg)
    c, s = np.cos(angle_rad), np.sin(angle_rad)
    rot_matrix_T = np.array([[c, s], [-s, c]])
    rotated_points = points.dot(rot_matrix_T)
    min_xy = np.min(rotated_points, axis=0)
    max_xy = np.max(rotated_points, axis=0)
    return max(max_xy[0] - min_xy[0], max_xy[1] - min_xy[1])

def optimize_rotation(trees):
    """Find optimal rotation angle for entire configuration"""
    from code.tree_geometry import get_tree_vertices_numba
    
    # Get all vertices
    all_points = []
    for x, y, angle in trees:
        rx, ry = get_tree_vertices_numba(x, y, angle)
        for xi, yi in zip(rx, ry):
            all_points.append([xi, yi])
    points_np = np.array(all_points)
    
    # Use convex hull for efficiency
    hull_points = points_np[ConvexHull(points_np).vertices]
    
    # Find optimal rotation
    initial_side = calculate_bbox_side_at_angle(0, hull_points)
    
    res = minimize_scalar(
        lambda a: calculate_bbox_side_at_angle(a, hull_points),
        bounds=(0.001, 89.999),
        method="bounded",
    )
    
    found_angle = float(res.x)
    found_side = float(res.fun)
    
    improvement = initial_side - found_side
    return initial_side, found_side, found_angle, improvement

# Test on a few N values
print("Testing fix_direction optimization:")
for n in [10, 50, 100, 150, 200]:
    trees = configs[n]
    initial, optimized, angle, improvement = optimize_rotation(trees)
    if improvement > 0.0001:
        print(f"  N={n}: {initial:.6f} -> {optimized:.6f} (angle={angle:.2f}Â°, improvement={improvement:.6f})")
    else:
        print(f"  N={n}: No improvement from rotation")

## Strategic Options

### Option 1: Run bbox3 for 3 hours (like top kernels)
- Use phased approach: Phase A (2min), Phase B (10min), Phase C (20min)
- Parameters: n=1000-2000, r=30-90
- Expected gain: Unknown, but top kernels use this

### Option 2: Implement fix_direction for all N
- Rotate entire configuration to minimize bbox
- This is a POST-PROCESSING step that can improve any solution
- Expected gain: 0.01-0.1 points

### Option 3: Lower threshold for "safe" N values
- Only N=2, 89, 123 have failed Kaggle
- Could use threshold=0.0001 for other N values
- Risk: Other N values might fail too

### Option 4: Generate NEW solutions from scratch
- Current solutions are all from same sources
- Need fundamentally different configurations
- Requires implementing new algorithms (NFP, genetic, etc.)

## Recommendation
**Option 2 (fix_direction) is the safest and most promising.**
- It's a pure post-processing step
- Top kernels use it
- Can be applied to our existing best solution
- No risk of overlap failures (doesn't change relative positions)