# Debug: CUDA Out of Memory Fix

## Problem
The training crashes with `torch.OutOfMemoryError` because:
1. **Variable-width images**: Text line images have widths up to 2475+ pixels
2. **Batch padding**: `AlignCollate` pads ALL images in a batch to the maximum width
3. **Memory explosion**: batch_size=48 × 128 height × 2475 width = **huge memory usage**

## Solutions

### Solution 1: Limit Maximum Image Width (Recommended)
Truncate very long images to a maximum width (e.g., 1600 pixels). This loses some text but prevents OOM.

### Solution 2: Dynamic Batch Sizing
Reduce batch size when images are very wide.

### Solution 3: Gradient Accumulation
Use smaller actual batch size but accumulate gradients to simulate larger batches.

## Fix 1: Update `utils/dataset.py` - Add Maximum Width Limit

This is the **primary fix**. We modify `AlignCollate` to cap the maximum width and add width limiting in `pil_loader`.

In [None]:
# Changes to make in utils/dataset.py

# 1. Update AlignCollate class to accept max_width parameter:
"""
class AlignCollate(object):
    def __init__(self, imgH=48, PAD='ZerosPAD', max_width=1600):
        self.imgH = imgH
        self.PAD = PAD
        self.max_width = max_width  # Maximum width to prevent OOM

    def __call__(self, batch):
        batch = filter(lambda x: x is not None, batch)
        images, labels = zip(*batch)

        maxW = 0
        for image in images:
            h, w, c = image.shape
            if w > maxW:
                maxW = w
        
        # Cap maximum width to prevent OOM errors
        maxW = min(maxW, self.max_width)

        if self.PAD == 'ZerosPAD':
            trans = ZerosPAD((1, self.imgH, maxW))
        elif self.PAD == 'NormalizePAD':
            trans = NormalizePAD((1, self.imgH, maxW))
        else:
            raise ValueError("not expected padding.")

        padded_images = []
        for image in images:
            h, w, c = image.shape
            # Truncate image if wider than max_width
            if w > maxW:
                image = image[:, :maxW, :]
            padded_images.append(trans(image))

        image_tensors = torch.cat([t.unsqueeze(0) for t in padded_images], 0)

        return image_tensors, labels
"""
print("AlignCollate fix ready - adds max_width parameter")

## Fix 2: Update `main.py` - Use New API and Lower Batch Size

Two issues to fix:
1. **Deprecated TF32 API**: PyTorch 2.9+ warns about old `allow_tf32` settings
2. **Deprecated AMP API**: `torch.cuda.amp.GradScaler` and `autocast` are deprecated
3. **Batch size**: Reduce from 48 to 32 for safety with variable-width images

In [None]:
# Changes to make in main.py

# 1. Update setup_a100_optimizations() to use new TF32 API:
"""
def setup_a100_optimizations():
    if torch.cuda.is_available():
        # New TF32 API (PyTorch 2.9+)
        try:
            # Try new API first
            torch.backends.cuda.matmul.allow_tf32 = True
            torch.backends.cudnn.allow_tf32 = True
        except AttributeError:
            pass
        
        # Enable cuDNN benchmarking
        torch.backends.cudnn.benchmark = True
        ...
"""

# 2. Update GradScaler to new API (line ~247):
# OLD: scaler = GradScaler()
# NEW: scaler = torch.amp.GradScaler('cuda')

# 3. Update autocast to new API (line ~382):
# OLD: with autocast():
# NEW: with torch.amp.autocast('cuda'):

# 4. Update imports at top:
# OLD: from torch.cuda.amp import autocast, GradScaler
# NEW: (remove this line, use torch.amp directly)

print("main.py fixes ready")

## Recommended Batch Sizes with max_width=1600

| GPU | VRAM | Recommended Batch Size |
|-----|------|----------------------|
| A100 40GB | 40 GB | 32-40 |
| A100 80GB | 80 GB | 48-64 |
| V100 | 16-32 GB | 16-24 |
| T4 | 16 GB | 8-12 |

With `max_width=1600`, batch_size=32 should be safe on A100 40GB.

## Summary of Changes Applied

The following files have been modified to fix the OOM error:

### 1. `utils/dataset.py`
- Added `max_width` parameter to `AlignCollate` class (default: 1600)
- Images wider than `max_width` are truncated
- Warning printed when truncation occurs

### 2. `main.py`
- Updated to use new PyTorch 2.0+ AMP API (`torch.amp.autocast('cuda')` and `torch.amp.GradScaler('cuda')`)
- Added `max_width=1600` to all `AlignCollate` calls
- Fixed deprecated TF32 warnings

### 3. `colab_train.ipynb`
- Reduced A100 batch size from 48 to 32 (safer with variable-width images)
- Reduced V100 batch size from 24 to 20
- Added documentation about max_width enforcement

## To Re-run Training

Simply run the training command again. The OOM should be fixed now:

In [None]:
# Run this command in Colab after pulling the updated code:
# !git pull origin main

# Then run training:
# !python main.py -m hctr -d data/hwdb2.0 -b 32 -ep 50 -pf 100 -vf 5000 -j 4

print("Training command ready. Use batch_size=32 for A100 40GB.")