# Loop 1 Analysis: Understanding the Gap to Target

**Current Status:**
- Best CV: 0.7584 (exp_000: DINOv2-base + LightGBM)
- Target: 0.79
- Gap: 0.0316

**Key Questions:**
1. What's causing the high variance in fold scores?
2. What features are most important?
3. How much improvement can we expect from larger models?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load training data
DATA_DIR = '/home/data'
train_df = pd.read_csv(f'{DATA_DIR}/train.csv')

# Pivot to image level
train_pivot = train_df.pivot_table(
    index=['image_path', 'Sampling_Date', 'State', 'Species', 'Pre_GSHH_NDVI', 'Height_Ave_cm'],
    columns='target_name',
    values='target'
).reset_index()

print(f'Total images: {len(train_pivot)}')
print(f'\nState distribution:')
print(train_pivot['State'].value_counts())
print(f'\nSpecies distribution:')
print(train_pivot['Species'].value_counts())

In [None]:
# Analyze target distributions and relationships
target_cols = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']

print('Target statistics:')
print(train_pivot[target_cols].describe())

# Check biomass constraints
print('\nBiomass constraint verification:')
print('GDM = Dry_Green + Dry_Clover?')
constraint1 = np.abs(train_pivot['GDM_g'] - (train_pivot['Dry_Green_g'] + train_pivot['Dry_Clover_g']))
print(f'  Max deviation: {constraint1.max():.4f}')
print(f'  Mean deviation: {constraint1.mean():.4f}')

print('\nDry_Total = GDM + Dry_Dead?')
constraint2 = np.abs(train_pivot['Dry_Total_g'] - (train_pivot['GDM_g'] + train_pivot['Dry_Dead_g']))
print(f'  Max deviation: {constraint2.max():.4f}')
print(f'  Mean deviation: {constraint2.mean():.4f}')

In [None]:
# Analyze correlations between targets and tabular features
print('Correlations with Dry_Total_g (highest weight = 0.5):')
print(f'  NDVI: {train_pivot["Pre_GSHH_NDVI"].corr(train_pivot["Dry_Total_g"]):.3f}')
print(f'  Height: {train_pivot["Height_Ave_cm"].corr(train_pivot["Dry_Total_g"]):.3f}')

print('\nCorrelations with GDM_g (weight = 0.2):')
print(f'  NDVI: {train_pivot["Pre_GSHH_NDVI"].corr(train_pivot["GDM_g"]):.3f}')
print(f'  Height: {train_pivot["Height_Ave_cm"].corr(train_pivot["GDM_g"]):.3f}')

In [None]:
# Analyze fold 5 issue - what makes it different?
# Simulate the fold split
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(kf.split(train_pivot)):
    val_data = train_pivot.iloc[val_idx]
    print(f'\nFold {fold + 1}:')
    print(f'  Size: {len(val_idx)}')
    print(f'  States: {val_data["State"].value_counts().to_dict()}')
    print(f'  Mean Dry_Total: {val_data["Dry_Total_g"].mean():.2f}')
    print(f'  Std Dry_Total: {val_data["Dry_Total_g"].std():.2f}')

In [None]:
# Summary of key findings
print('='*60)
print('KEY FINDINGS FOR STRATEGY')
print('='*60)
print()
print('1. BIOMASS CONSTRAINTS ARE EXACT in training data')
print('   - GDM = Dry_Green + Dry_Clover (exact)')
print('   - Dry_Total = GDM + Dry_Dead (exact)')
print('   -> Post-processing to enforce these will improve consistency')
print()
print('2. TABULAR FEATURES have moderate correlation with targets')
print('   - NDVI and Height are useful but not sufficient alone')
print('   -> Image features are critical')
print()
print('3. DATA HETEROGENEITY exists across folds')
print('   - Different state distributions per fold')
print('   -> Consider stratified CV by State')
print()
print('4. TOP KERNELS USE:')
print('   - DINOv2-giant (1536 dims) vs our base (768 dims)')
print('   - Patch-based features (last_hidden_state[:,1:,:])')
print('   - Multiple model ensemble (GB, HGB, CatBoost, LightGBM)')
print('   - TTA with flips, rotations, blur')
print('   - Post-processing for biomass constraints')