This version continues on V2 to generate an improved image classification model for Paddy disease. The script trains an ensemble of larger models with larger inputs. However, using larger models comes with a larger memory expense (primarily during gradient calculations). Therefore, various tactics are introduced to tackle them. 

In order to test quickly on a few different models, image sizes and discover the successful ones, a small subset of data is collected for running short epochs -- the memory use will still be the same, but it'll be much faster.

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [2]:
try: import fastkaggle
except ModuleNotFoundError:
    !pip install -Uq fastkaggle

from fastkaggle import *

In [3]:
from fastai.imports import *
np.set_printoptions(linewidth=130)
import os
from pathlib import Path
path = Path('/kaggle/input/paddy-doctor/paddy-disease-classification')
path

Path('/kaggle/input/paddy-doctor/paddy-disease-classification')

In [4]:
from fastai.vision.all import *
set_seed(42)

path.ls()

(#6) [Path('/kaggle/input/paddy-doctor/paddy-disease-classification/sample_submission.csv'),Path('/kaggle/input/paddy-doctor/paddy-disease-classification/train_images'),Path('/kaggle/input/paddy-doctor/paddy-disease-classification/.jovianrc'),Path('/kaggle/input/paddy-doctor/paddy-disease-classification/.ipynb_checkpoints'),Path('/kaggle/input/paddy-doctor/paddy-disease-classification/train.csv'),Path('/kaggle/input/paddy-doctor/paddy-disease-classification/test_images')]

In [5]:
tst_files = get_image_files(path/'test_images').sorted()

In [6]:
# Finding no of data points for different classes

df = pd.read_csv(path/'train.csv')
df.label.value_counts()

label
normal                      1764
blast                       1738
hispa                       1594
dead_heart                  1442
tungro                      1088
brown_spot                   965
downy_mildew                 620
bacterial_leaf_blight        479
bacterial_leaf_streak        380
bacterial_panicle_blight     337
Name: count, dtype: int64

In [7]:
# bacterial_panicle_blight is selected for training (tests), since it's the smallest

trn_path = path/'train_images'/'bacterial_panicle_blight'

## Setting up train function

The train function is based on the same idea as per V2, with a few additional tweaks. A *finetune* argument is introduced to run either fine_tune() method, or the fit_one_cycle() method; the latter is faster since it doesn't do an initial fine-tuning of the head. The finetune function also performs TTA predictions on the test set, since later on the TTA results of a number of models will be ensembled. 

The train function no longer has seed=42 in the ImageDataLoaders line, leading to different training and validation sets each time it's called. It is appropriate for ensembling, since it means that each model will use slightly different data.

Further, 'gradient accumulation' is introduced using the *accum* argument, which primarily does two things: 1) divides the batch size by accum and 2) adds the GradientAccumulation callback, passing in accum

Gradient accumulation refers to a very simple trick: rather than updating the model weights after every batch based on that batch's gradients, instead keep accumulating (adding up) the gradients for a few batches, and then update the model weights with those accumulated gradients. 

In fastai, the parameter passed to GradientAccumulation defines how many batches of gradients are accumulated. Since here the gradients are adding up over 'accum' batches, the batch size needs to be divided by that same number. The resulting training loop is nearly mathematically identical to using the original batch size, but the amount of memory used is the same as using a batch size accum times smaller!

For instance, here's a basic example of a single epoch of a training loop without gradient accumulation:

    for x,y in dl:
        calc_loss(coeffs, x, y).backward()
        coeffs.data.sub_(coeffs.grad * lr)
        coeffs.grad.zero_()
        
Here's the same thing, but with gradient accumulation added (assuming a target effective batch size of 64):

    count = 0            # track count of items seen since last weight update
    for x,y in dl:
        count += len(x)  # update count based on this minibatch size
        calc_loss(coeffs, x, y).backward()
        if count>=64:     # count is greater than accumulation target, so do weight update
            coeffs.data.sub_(coeffs.grad * lr)
            coeffs.grad.zero_()
            count=0      # reset count
            
----------------------

The original script was as follows:

    def train(arch, size, item=Resize(480, method='squish'), accum=1, finetune=True, epochs=12):
        dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item,
            batch_tfms=aug_transforms(size=size, min_scale=0.75), bs=64//accum)
            
        cbs = GradientAccumulation(64) if accum else []
        
        learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
        
        if finetune:
            learn.fine_tune(epochs, 0.01)
            return learn.tta(dl=dls.test_dl(tst_files))
        else:
            learn.unfreeze()
            learn.fit_one_cycle(epochs, 0.01)


However, this has issues with the swin models. vision_learner is a FastAI wrapper around timm models and when arch='swin_large_patch4_window7_224' is passed, FastAI tries to build a model from timm.

This gives an error "RuntimeError: running_mean should contain 14 elements not 3072". 

It's because FastAI is trying to apply a BatchNorm head to Swin’s outputs, and the dimensions don’t match. Swin’s final feature dimension is 1536 or 3072, depending on the variant, while FastAI’s default vision_learner assumes something closer to ResNet-like outputs. That mismatch makes BatchNorm blow up.

Therefore, the following train function is used, which hardcodes the timm architecture. It controls the head (num_classes) explicitly, and Learner is created manually with Learner(...)

The fine_tune function has explicit control: 1 epoch with frozen body at 1e-3, then unfreeze and train for the remaining epochs at 1e-4.

In  the earlier (original) function: fine_tune(epochs, lr), Fastai does freeze → unfreeze automatically, with its LR finder/discriminative LR system. Easier, but less transparent — and limited to models that vision_learner supports.

In [8]:
import timm

def train(arch, size, item=Resize(480, method='squish'), accum=1, finetune=True, epochs=12):
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75),bs=64//accum)

    cbs = GradientAccumulation(64) if accum else []

    # Build the model directly in timm with the correct number of classes
    model = timm.create_model(arch, pretrained=True, num_classes=dls.c)

    # Standard fastai Learner on that model (no fastai auto-head)
    learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=error_rate, cbs=cbs).to_fp16()

    if finetune:
        # mimic fine_tune: train head a bit, then unfreeze
        learn.freeze()
        learn.fit_one_cycle(1, 1e-3)
        learn.unfreeze()
        learn.fit_one_cycle(epochs-1 if epochs > 1 else 1, 1e-4)
        return learn.tta(dl=dls.test_dl(tst_files))
    else:
        learn.fit_one_cycle(epochs, 1e-3)

    timm.create_model(arch, pretrained=True, num_classes=dls.c)

'arch': Defines the architecture, e.g.,convnext_small_in22k as used below

'pretrained=True': Loads the model with weights pre-trained on a large dataset

'num_classes=dls.c': Sets the number of output classes to match the number of classes in the dataset (dls.c is the number of classes in the DataLoaders)

It leverages a state-of-the-art model architecture with pre-trained weights, useful for transfer learning. The model is adapted to the task at hand by setting the correct number of output classes.

    learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=error_rate, cbs=cbs).to_fp16()

'loss_func=CrossEntropyLossFlat()': Specifies the loss function. CrossEntropyLossFlat is standard for multi-class classification

'metrics=error_rate': Tracks the error rate during training

'cbs=cbs': Adds any callbacks, such as GradientAccumulation

'.to_fp16()': Enables mixed-precision training, which speeds up training and reduces memory usage by using 16-bit floating-point numbers where possible

In [9]:
# Small model to check the impact of gradient accumulation:

train('convnext_small_in22k', 128, epochs=1, accum=1, finetune=False)

  model = create_fn(


model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()
  self.autocast,self.learn.scaler,self.scales = autocast(dtype=dtype),GradScaler(**self.kwargs),L()


epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:10


This function finds out how much memory is used, and also to then clears out the memory for the next run:

In [10]:
import gc
def report_gpu():
    print(torch.cuda.list_gpu_processes())
    gc.collect()
    torch.cuda.empty_cache()

In [11]:
report_gpu()

GPU:0
process       2688 uses     3364.000 MB GPU memory


roughly 3.5 GB of memory with accum=1

In [12]:
train('convnext_small_in22k', 128, epochs=1, accum=2, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:11


GPU:0
process       2688 uses     2322.000 MB GPU memory


roughly 2.5 GB of memory with accum=1

In [13]:
train('convnext_small_in22k', 128, epochs=1, accum=4, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:19


GPU:0
process       2688 uses     1804.000 MB GPU memory


roughly 2 GB of memory with accum=4

## Checking memory use

The memory use for each of the following architectures and sizes is checked, to ensure they all fit in 16GB RAM (Kaggle limit). These will be used for training later. accum=1 could give a large memory error, so accum=2 is used.

convnext_large:

In [14]:
train('convnext_large_in22k', 224, epochs=1, accum=2, finetune=False)
report_gpu()

  model = create_fn(


model.safetensors:   0%|          | 0.00/919M [00:00<?, ?B/s]

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:32


GPU:0
process       2688 uses    10228.000 MB GPU memory


In [15]:
train('convnext_large_in22k', (320,240), epochs=1, accum=2, finetune=False)
report_gpu()

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:39


GPU:0
process       2688 uses    13538.000 MB GPU memory


Convnext with two different image sizes uses around 10-14 GB

vit and swin models: Here, the _xyz is the size of the image used in the org format of these models, and it's a good practice to use that size only when using for training purposes

In [16]:
train('vit_large_patch16_224', 224, epochs=1, accum=2, finetune=False)
report_gpu()

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:17


GPU:0
process       2688 uses    11762.000 MB GPU memory


In [17]:
train('swinv2_large_window12_192_22k', 192, epochs=1, accum=2, finetune=False)
report_gpu()

  model = create_fn(


model.safetensors:   0%|          | 0.00/917M [00:00<?, ?B/s]

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:12


GPU:0
process       2688 uses    12714.000 MB GPU memory


In [18]:
train('swin_large_patch4_window7_224', 224, epochs=1, accum=2, finetune=False)
report_gpu()

model.safetensors:   0%|          | 0.00/788M [00:00<?, ?B/s]

epoch,train_loss,valid_loss,error_rate,time
0,0.0,0.0,0.0,00:12


GPU:0
process       2688 uses    11122.000 MB GPU memory


Since these models use memory less than 16 GB, they can be used for complete training

## Running the models

The train function defined above works perfectly, but it takes a lot of time in the epochs, e.g., each epoch takes around 20 mins

To reduce this, a new train function (without making any substantial changes) is defined. However, this still takes about 8-10 mins. Thus, to go over all listed models will use a considerable amount of GPU time (30-hour limit in Kaggle). Therefore, only 1 epoch is used to show the complete procedure.

In [19]:
import timm
import torch
from fastai.vision.all import *

# Enable cudnn auto-tuner
torch.backends.cudnn.benchmark = True

def train(arch, size, item=Resize(480, method='squish'), accum=1, finetune=True, epochs=12, bs=64):
    # DataLoaders
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item,batch_tfms=aug_transforms(size=size, min_scale=0.75),
        bs=bs//accum,
        num_workers=4,  # increase if Kaggle allows
        pin_memory=True
    )

    # Gradient accumulation callback
    cbs = [GradientAccumulation(accum)] if accum > 1 else []

    # Create timm model
    model = timm.create_model(arch, pretrained=True, num_classes=dls.c)

    # Learner
    learn = Learner(
        dls, model,
        loss_func=CrossEntropyLossFlat(),
        metrics=error_rate,
        cbs=cbs
    ).to_fp16()

    if finetune:
        # Mimic fastai fine_tune
        learn.freeze()
        learn.fit_one_cycle(1, 1e-3)
        learn.unfreeze()
        learn.fit_one_cycle(max(epochs-1, 1), 1e-4)
        return learn.tta(dl=dls.test_dl(tst_files))
    else:
        learn.fit_one_cycle(epochs, 1e-3)
        return learn  # return for ensemble use


In [20]:
res = 640,480

In [21]:
# cleaning up RAM before initiating 

report_gpu()

GPU:0
process       2688 uses      388.000 MB GPU memory


In [22]:
models = {
    'convnext_large_in22k': {
        (Resize(res), (320,224)),
    }, 'vit_large_patch16_224': {
        (Resize(480, method='squish'), 224),
        (Resize(res), 224),
    }, 'swinv2_large_window12_192_22k': {
        (Resize(480, method='squish'), 192),
        (Resize(res), 192),
    }, 'swin_large_patch4_window7_224': {
        (Resize(res), 224),
    }
}

In [23]:
trn_path = path/'train_images'

In [24]:
tta_res = []

for arch, details in models.items():
    for item, size in details:
        print(f'--- {arch} | size {size} | item {item.name}')
        learn = train(arch, size, item=item, accum=2, epochs=1)
        tta_res.append(learn)
        gc.collect()
        torch.cuda.empty_cache()

--- convnext_large_in22k | size (320, 224) | item Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,0.477811,0.311465,0.097069,11:42


epoch,train_loss,valid_loss,error_rate,time
0,0.182102,0.172453,0.051418,11:28


--- vit_large_patch16_224 | size 224 | item Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,2.123993,2.053512,0.728015,08:49


epoch,train_loss,valid_loss,error_rate,time
0,1.705292,1.626145,0.573763,08:48


--- vit_large_patch16_224 | size 224 | item Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,2.294318,2.181,0.836617,08:47


epoch,train_loss,valid_loss,error_rate,time
0,2.082934,2.047758,0.729938,08:47


--- swinv2_large_window12_192_22k | size 192 | item Resize -- {'size': (480, 480), 'method': 'squish', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,2.17308,2.159558,0.827006,05:39


epoch,train_loss,valid_loss,error_rate,time
0,2.055514,2.011435,0.715041,05:39


--- swinv2_large_window12_192_22k | size 192 | item Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,1.869004,1.718959,0.621336,05:40


epoch,train_loss,valid_loss,error_rate,time
0,1.112241,0.954458,0.326766,05:40


--- swin_large_patch4_window7_224 | size 224 | item Resize -- {'size': (480, 640), 'method': 'crop', 'pad_mode': 'reflection', 'resamples': (<Resampling.BILINEAR: 2>, <Resampling.NEAREST: 0>), 'p': 1.0}


epoch,train_loss,valid_loss,error_rate,time
0,2.154639,2.154676,0.824123,05:55


epoch,train_loss,valid_loss,error_rate,time
0,2.073656,1.993422,0.727535,05:55


## Ensembling

Saving the result

In [25]:
save_pickle('tta_res.pkl', tta_res)

Learner.tta returns predictions and targets for each row, only predictions are needed

In [26]:
tta_prs = first(zip(*tta_res))

In [27]:
tta_prs += tta_prs[1:3]

An ensemble simply refers to a model which is itself the result of combining a number of other models. The simplest way to do ensembling is to take the average of the predictions of each model:

In [28]:
avg_pr = torch.stack(tta_prs).mean(0)
avg_pr.shape

torch.Size([3469, 10])

In [29]:
dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=Resize(480, method='squish'),
    batch_tfms=aug_transforms(size=224, min_scale=0.75))

In [30]:
idxs = avg_pr.argmax(dim=1)
vocab = np.array(dls.vocab)
ss = pd.read_csv(path/'sample_submission.csv')
ss['label'] = vocab[idxs]
ss.to_csv('subm.csv', index=False)

In [32]:
!head subm.csv

image_id,label
200001.jpg,hispa
200002.jpg,normal
200003.jpg,normal
200004.jpg,blast
200005.jpg,blast
200006.jpg,brown_spot
200007.jpg,dead_heart
200008.jpg,hispa
200009.jpg,hispa
