# Evolver Loop 2 Analysis

## Objective
Analyze the winning solutions and create a strategic plan to close the 13.95-point gap from 77.37% to 91.32%.

## Key Findings from Winning Solutions

### 1st Place (Golddiggaz) - 91.32% private LB
- **Ensemble of 4 models**: ResNeXt50, ViT-B/16, EfficientNet-B4 (NoisyStudent), CropNet (MobileNetV3)
- **Key insight**: CropNet from TF Hub (pretrained on cassava) added crucial diversity
- **Heavy augmentation**: RandomResizedCrop, Transpose, flips, ShiftScaleRotate, color transforms, CoarseDropout, Cutout
- **Advanced loss**: Bit Tempered Logistic Loss, label smoothing
- **TTA**: Overlapping patches with multiple augmentations

### 2nd Place (Devon Stanfield) - 90.25% public LB  
- **Simple approach**: Just CropNet from TF Hub
- **Shows power of domain-specific pretrained model**

## Current State
- **Best CV**: 77.37% (exp_000, baseline CNN)
- **Target**: 91.32%
- **Gap**: 13.95 points
- **Class imbalance**: CMD (class 3) = 61.5%, CBB (class 0) = 5.1%

## Strategic Priorities

### 1. Transfer Learning (Immediate - 8-12 point gain expected)
- Custom CNN from scratch → Pretrained models
- Start with CropNet (domain-specific) + EfficientNet-B4

### 2. Heavy Augmentation (2-3 point gain)
- Current: basic flips/rotations
- Target: RandAugment, CutMix, MixUp, color transforms

### 3. Class Imbalance Handling (1-2 point gain)
- Current: None
- Target: Class weights, focal loss, oversampling

### 4. Ensembling (2-3 point gain)
- Single model → Multiple diverse architectures
- CropNet + EfficientNet + ResNeXt/ViT

### 5. Advanced Training (1-2 point gain)
- Current: Fixed LR, 10 epochs
- Target: Cosine annealing, warmup, 30-50 epochs

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load training data
train_df = pd.read_csv('/home/code/data/train.csv')

# Class distribution analysis
class_counts = train_df['label'].value_counts().sort_index()
class_names = ['CBB', 'CBSD', 'CGM', 'CMD', 'Healthy']

print("Class Distribution:")
for i, (count, name) in enumerate(zip(class_counts, class_names)):
    percentage = count / len(train_df) * 100
    print(f"Class {i} ({name}): {count:,} samples ({percentage:.1f}%)")

# Calculate imbalance ratio
imbalance_ratio = class_counts.max() / class_counts.min()
print(f"\nImbalance ratio (max/min): {imbalance_ratio:.1f}x")

# Class weights for loss function
total_samples = len(train_df)
class_weights = total_samples / (len(class_counts) * class_counts)
print(f"\nClass weights: {class_weights.round(2).tolist()}")

# Visualize
plt.figure(figsize=(10, 6))
sns.barplot(x=class_names, y=class_counts.values)
plt.title('Class Distribution - Cassava Leaf Disease Dataset')
plt.ylabel('Number of Samples')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Key Insights

1. **Severe imbalance**: CMD (class 3) has 12.1x more samples than CBB (class 0)
2. **Class weights needed**: [3.94, 1.96, 1.79, 0.33, 1.66] for classes 0-4
3. **Transfer learning is critical**: Winners used pretrained models, not from scratch
4. **CropNet is special**: Domain-specific pretrained model on cassava diseases
5. **Ensembling essential**: Single models plateau around 89-90%, ensembles reach 91.3%+

In [None]:
# Verify data integrity
import os
from PIL import Image

data_dir = '/home/code/data'
train_dir = os.path.join(data_dir, 'train_images')

# Check image files
image_files = os.listdir(train_dir)
print(f"Total image files: {len(image_files)}")
print(f"CSV entries: {len(train_df)}")
print(f"Match: {len(image_files) == len(train_df)}")

# Sample image dimensions
sample_images = image_files[:10]
dimensions = []
for img_file in sample_images:
    img_path = os.path.join(train_dir, img_file)
    with Image.open(img_path) as img:
        dimensions.append(img.size)

print(f"\nSample image dimensions: {dimensions}")
print(f"Average dimension: {np.mean([d[0] for d in dimensions]):.0f}x{np.mean([d[1] for d in dimensions]):.0f}")