# Loop 1 Analysis: Understanding the Gap to Target

**Current Status:**
- Best CV: 0.7584 (exp_000: DINOv2-base + LightGBM)
- Target: 0.79
- Gap: 0.0316

**Key Questions:**
1. What's causing the high variance in fold scores?
2. What features are most important?
3. How much improvement can we expect from larger models?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load training data
DATA_DIR = '/home/data'
train_df = pd.read_csv(f'{DATA_DIR}/train.csv')

# Pivot to image level
train_pivot = train_df.pivot_table(
    index=['image_path', 'Sampling_Date', 'State', 'Species', 'Pre_GSHH_NDVI', 'Height_Ave_cm'],
    columns='target_name',
    values='target'
).reset_index()

print(f'Total images: {len(train_pivot)}')
print(f'\nState distribution:')
print(train_pivot['State'].value_counts())
print(f'\nSpecies distribution:')
print(train_pivot['Species'].value_counts())

Total images: 357

State distribution:
State
Tas    138
Vic    112
NSW     75
WA      32
Name: count, dtype: int64

Species distribution:
Species
Ryegrass_Clover                                                98
Ryegrass                                                       62
Phalaris_Clover                                                42
Clover                                                         41
Fescue                                                         28
Lucerne                                                        22
Phalaris_BarleyGrass_SilverGrass_SpearGrass_Clover_Capeweed    11
Fescue_CrumbWeed                                               10
WhiteClover                                                    10
Phalaris                                                        8
Phalaris_Ryegrass_Clover                                        8
Phalaris_Clover_Ryegrass_Barleygrass_Bromegrass                 7
SubcloverLosa                                                 

In [2]:
# Analyze target distributions and relationships
target_cols = ['Dry_Green_g', 'Dry_Dead_g', 'Dry_Clover_g', 'GDM_g', 'Dry_Total_g']

print('Target statistics:')
print(train_pivot[target_cols].describe())

# Check biomass constraints
print('\nBiomass constraint verification:')
print('GDM = Dry_Green + Dry_Clover?')
constraint1 = np.abs(train_pivot['GDM_g'] - (train_pivot['Dry_Green_g'] + train_pivot['Dry_Clover_g']))
print(f'  Max deviation: {constraint1.max():.4f}')
print(f'  Mean deviation: {constraint1.mean():.4f}')

print('\nDry_Total = GDM + Dry_Dead?')
constraint2 = np.abs(train_pivot['Dry_Total_g'] - (train_pivot['GDM_g'] + train_pivot['Dry_Dead_g']))
print(f'  Max deviation: {constraint2.max():.4f}')
print(f'  Mean deviation: {constraint2.mean():.4f}')

Target statistics:
target_name  Dry_Green_g  Dry_Dead_g  Dry_Clover_g       GDM_g  Dry_Total_g
count         357.000000  357.000000    357.000000  357.000000   357.000000
mean           26.624722   12.044548      6.649692   33.274414    45.318097
std            25.401232   12.402007     12.117761   24.935822    27.984015
min             0.000000    0.000000      0.000000    1.040000     1.040000
25%             8.800000    3.200000      0.000000   16.026100    25.271500
50%            20.800000    7.980900      1.423500   27.108200    40.300000
75%            35.083400   17.637800      7.242900   43.675700    57.880000
max           157.983600   83.840700     71.786500  157.983600   185.700000

Biomass constraint verification:
GDM = Dry_Green + Dry_Clover?
  Max deviation: 0.0001
  Mean deviation: 0.0000

Dry_Total = GDM + Dry_Dead?
  Max deviation: 0.3088
  Mean deviation: 0.0009


In [3]:
# Analyze correlations between targets and tabular features
print('Correlations with Dry_Total_g (highest weight = 0.5):')
print(f'  NDVI: {train_pivot["Pre_GSHH_NDVI"].corr(train_pivot["Dry_Total_g"]):.3f}')
print(f'  Height: {train_pivot["Height_Ave_cm"].corr(train_pivot["Dry_Total_g"]):.3f}')

print('\nCorrelations with GDM_g (weight = 0.2):')
print(f'  NDVI: {train_pivot["Pre_GSHH_NDVI"].corr(train_pivot["GDM_g"]):.3f}')
print(f'  Height: {train_pivot["Height_Ave_cm"].corr(train_pivot["GDM_g"]):.3f}')

Correlations with Dry_Total_g (highest weight = 0.5):
  NDVI: 0.361
  Height: 0.497

Correlations with GDM_g (weight = 0.2):
  NDVI: 0.467
  Height: 0.583


In [4]:
# Analyze fold 5 issue - what makes it different?
# Simulate the fold split
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(kf.split(train_pivot)):
    val_data = train_pivot.iloc[val_idx]
    print(f'\nFold {fold + 1}:')
    print(f'  Size: {len(val_idx)}')
    print(f'  States: {val_data["State"].value_counts().to_dict()}')
    print(f'  Mean Dry_Total: {val_data["Dry_Total_g"].mean():.2f}')
    print(f'  Std Dry_Total: {val_data["Dry_Total_g"].std():.2f}')


Fold 1:
  Size: 72
  States: {'Tas': 32, 'Vic': 17, 'NSW': 14, 'WA': 9}
  Mean Dry_Total: 43.61
  Std Dry_Total: 24.30

Fold 2:
  Size: 72
  States: {'Tas': 26, 'Vic': 24, 'NSW': 14, 'WA': 8}
  Mean Dry_Total: 46.19
  Std Dry_Total: 28.69

Fold 3:
  Size: 71
  States: {'Tas': 24, 'NSW': 23, 'Vic': 17, 'WA': 7}
  Mean Dry_Total: 46.63
  Std Dry_Total: 31.32

Fold 4:
  Size: 71
  States: {'Tas': 29, 'Vic': 28, 'NSW': 9, 'WA': 5}
  Mean Dry_Total: 44.71
  Std Dry_Total: 27.89

Fold 5:
  Size: 71
  States: {'Tas': 27, 'Vic': 26, 'NSW': 15, 'WA': 3}
  Mean Dry_Total: 45.46
  Std Dry_Total: 28.00


In [5]:
# Summary of key findings
print('='*60)
print('KEY FINDINGS FOR STRATEGY')
print('='*60)
print()
print('1. BIOMASS CONSTRAINTS ARE EXACT in training data')
print('   - GDM = Dry_Green + Dry_Clover (exact)')
print('   - Dry_Total = GDM + Dry_Dead (exact)')
print('   -> Post-processing to enforce these will improve consistency')
print()
print('2. TABULAR FEATURES have moderate correlation with targets')
print('   - NDVI and Height are useful but not sufficient alone')
print('   -> Image features are critical')
print()
print('3. DATA HETEROGENEITY exists across folds')
print('   - Different state distributions per fold')
print('   -> Consider stratified CV by State')
print()
print('4. TOP KERNELS USE:')
print('   - DINOv2-giant (1536 dims) vs our base (768 dims)')
print('   - Patch-based features (last_hidden_state[:,1:,:])')
print('   - Multiple model ensemble (GB, HGB, CatBoost, LightGBM)')
print('   - TTA with flips, rotations, blur')
print('   - Post-processing for biomass constraints')

KEY FINDINGS FOR STRATEGY

1. BIOMASS CONSTRAINTS ARE EXACT in training data
   - GDM = Dry_Green + Dry_Clover (exact)
   - Dry_Total = GDM + Dry_Dead (exact)
   -> Post-processing to enforce these will improve consistency

2. TABULAR FEATURES have moderate correlation with targets
   - NDVI and Height are useful but not sufficient alone
   -> Image features are critical

3. DATA HETEROGENEITY exists across folds
   - Different state distributions per fold
   -> Consider stratified CV by State

4. TOP KERNELS USE:
   - DINOv2-giant (1536 dims) vs our base (768 dims)
   - Patch-based features (last_hidden_state[:,1:,:])
   - Multiple model ensemble (GB, HGB, CatBoost, LightGBM)
   - TTA with flips, rotations, blur
   - Post-processing for biomass constraints
