Class balanced data loader #1399

shahariel · 2023-06-01T10:32:07Z

shahariel
Jun 1, 2023

Hi!

I'm working on a class-incremental learning problem with a replay buffer. The scenario is SplitCIFAR10 with 5 experiences, and the model has a final layer of IncrementalClassifier.

I use ClassBalancedBuffer with adaptive size in order to keep the buffre balanced among the classes in it.

I'm trying to find a way to make a data loader that would load class-balanced batches - such that each batch conatains about the same amount of examplars from each class (wether it's from the buffer or from the current experience).
For example, if Experience 0 contained the classes [4, 7] and Experience 1 contained [0, 5], and the batch_size is 100, then I want the data loader to load batchs such that each batch contains 25 from each of 4, 7, 0, 5 (even though [4, 7] are in the buffer and [0, 5] aren't.

I thought GroupBalancedDataLoader might fit here, but I can't make it work correctly. The documentation is not very clear. What should I pass to "datasets" in order to make it class-balanced? I see it should be a sequence of datasets, but what datasets?

EDIT: I think I've managed to make GroupBalancedDataLoader to be class-balanced (I passed as "datasets" a list with a separate ClassificationDataset for each class).
But now the batches are too "organized" by the classes, meaning the first 25 samples in the batch are of class "4", the next 25 are of class "7" etc... This is the way I initialized the dataloader:

strategy.dataloader = GroupBalancedDataLoader( self.storage_policy.buffer_datasets + cl_datasets, oversample_small_groups=True, num_workers=num_workers, batch_size=strategy.train_mb_size, shuffle=True )

As I passed shuffle=True, I expected the classes to be shuffled in the class. How can I make it happen?

Thank you

Answered by shahariel

Jun 11, 2023

Thank you for your answer!
I actually think my solution (under "EDIT" in my question) is ok. It's not a problem that the batches are organized by the classes.

View full answer

AymenTlili131 · 2023-06-05T12:18:40Z

AymenTlili131
Jun 5, 2023

Hey Shahariel,
I'm not an avalanche maintainer but I wanted to achieve something similar recently where instead of handling the entirety of the 1st training stream on a SplitMnist (2 experiences) I need to create smaller batchs of the shuffled images.

Essentially in cross validation methods there's a stratified version where we use a smaller subset of a dataset to train on and then test on the rest (this eliminates to an extinct a bias in choosing validation data) while it's not perfectly balanced, each fold mimics the distribution in the original dataset

how to do this in Avalanche isn't immediatly clear but as you mentionned split_mnist.train_stream[0].dataset[image_index] contains a tuple with image tensor , target and the task stream it belongs to.
split_mnist.train_stream[0].dataset.targets also contains a list of the targets .

My method essentially :

create stratified K folds and grab the test objects from the train steram in question
saves the images localy
access Avalanche strategy object and grab the model
create a custom pytorch dataloader (with shuffling)
perform the usual training(in your case you might want to create a custom benchmark following Avalanche if you intend to use a strategy , I'd try to the best of what's possible to avoid the tensor benchmark as it has known issues when loading the targets as lists ) .

Code-wise , my solution looks like this but experiment with it until it fits your needs :
creating a list of experiences to encounter

from itertools import product
from itertools import combinations
S=combinations(range(10), 5)
All=list(range(10))
ListExperiences=list()
for i in S :
  L1=list(i) 
  L2=[k for k in All if k not in L1] 
  L1.extend(L2)
  ListExperiences.append(L1)

save the images localy [note here I'm creating all the possible combinations of 2 experience SplitMnist] , creating 20 balanced training subsets of the first training stream (for fine tuning purposes) and you can save the 2nd trainset and other 2 testsets by adjusting the "enumerate" loop .

from sklearn.model_selection import StratifiedKFold
for exp_idx in range(len(ListExperiences)):
    split_mnist = SplitMNIST(n_experiences=2, fixed_class_order=ListExperiences[exp_idx],shuffle=False) //note the shuffle option 
    skf = StratifiedKFold(n_splits=20,random_state=1234,shuffle=True)    //this needs to be set to True tho
    for i,(train_index,test_index) in tqdm(enumerate(skf.split(split_mnist.train_stream[0].dataset,split_mnist.train_stream[0].dataset.targets))): //the 0 indicates we're working on the 1st train stream , in 2 experience splitMnist goes up to 1
        print("fold :",i)
        print("Train:",len(train_index))
        print("fold :",len(test_index))
        for img_idx in test_index :
            if not(os.path.isdir(dirpath+'{}'.format(ListExperiences[exp_idx][:5]))): //creating each parent folder and naming it via code to match the experience 
                os.mkdir(dirpath+'{}'.format(ListExperiences[exp_idx][:5]))
            if not(os.path.isdir(dirpath+'{}//train//'.format(ListExperiences[exp_idx][:5]))):
                os.mkdir(dirpath+'{}//train//'.format(ListExperiences[exp_idx][:5]))
            if not(os.path.isdir(dirpath+'{}//train//fold {}'.format(ListExperiences[exp_idx][:5],i))):
                os.mkdir(dirpath+'{}//train//fold {}'.format(ListExperiences[exp_idx][:5],i))
            if not(os.path.isdir(dirpath+'{}//train//fold {}//{}'.format(ListExperiences[exp_idx][:5],i,split_mnist.train_stream[0].dataset[img_idx][1]))):
                os.mkdir(dirpath+'{}//train//fold {}//{}'.format(ListExperiences[exp_idx][:5],i,split_mnist.train_stream[0].dataset[img_idx][1]))
            torchvision.utils.save_image(split_mnist.train_stream[0].dataset[img_idx][0], dirpath+'{}//train//fold {}//{}//{}.png'.format(ListExperiences[exp_idx][:5],i,split_mnist.train_stream[0].dataset[img_idx][1],img_idx))

and then to to create the custom dataset following pytorch's api , I made sure that the 2 test streams are seperate but in your case if you want to test on all in one go u should save them in the same folder and have a common ImageFolder object


for i,(train_index,test_index) in tqdm(enumerate(skf.split(split_mnist.train_stream[0].dataset,split_mnist.train_stream[0].dataset.targets))):
                test_0_dirpath=r"-------path to data-----/data/SplitMnist/{}/test 0/".format(ListExperiences[exp_idx][:5])
                test_1_dirpath=r""-------path to data-----/data/SplitMnist/{}/test 1/".format(ListExperiences[exp_idx][:5])
                dirpath=r"-------path to data-----data/SplitMnist/{}/train/fold {}/".format(ListExperiences[exp_idx][:5],i)

                train_IF=ImageFolder(dirpath,transforms.Compose([ transforms.ToTensor(),transforms.Grayscale(1)]))
                test0_IF=ImageFolder(test_0_dirpath,transform = transforms.Compose([ transforms.ToTensor(),transforms.Grayscale(1) ]) )
                test1_IF=ImageFolder(test_1_dirpath,transform = transforms.Compose([ transforms.ToTensor(),transforms.Grayscale(1) ]) ) //need the grayscale transform because torchvision's save image method saves as 3 channel tensors anyways

                test_dataloader_custom0 = DataLoader(dataset=test0_IF, # use custom created test Dataset
                                                    batch_size=60, 
                                                    num_workers=0, 
                                                    shuffle=True)# don't usually need to shuffle testing data

                test_dataloader_custom1 = DataLoader(dataset=test1_IF, # use custom created test Dataset
                                                    batch_size=60, 
                                                    num_workers=0, 
                                                    shuffle=True)
                train_dataloader_custom = DataLoader(dataset=train_IF, 
                                                         batch_size=60, 
                                                         num_workers=0, 
                                                         shuffle=True) 

//skip this if you're training from scratch but note how the model can be accessed in Avalanche's strategy object
import dill
model = CNN(1,activ,0,init) #CNN class was defined before-hand and depends on your need for mnist 
model.load_state_dict(torch.load('./checkpoints/Cumulative/{}//1/Reconstructed checkpoint.pth'.format(ListExperiences[exp_idx][:5])))

check=torch.load('./checkpoints/Cumulative/{}/1/checkpoint.pth'.format(ListExperiences[exp_idx][:5]),pickle_module=dill)
check["strategy"].model=model

from there you can essentialy use a classic train/validate method like

def train(model, trainloader, optimizer, criterion):
    model.train()
    print('Training')
    train_running_loss = 0.0
    train_running_correct = 0
    counter = 0
    for i, data in enumerate(trainloader):
        counter += 1
        image, labels = data
        image = image.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        # forward pass
        outputs = model(image)
        # calculate the loss
        loss = criterion(outputs, labels)
        train_running_loss += loss.item()
        # calculate the accuracy
        _, preds = torch.max(outputs.data, 1)
        train_running_correct += (preds == labels).sum().item()
        # backpropagation
        loss.backward()
        # update the optimizer parameters
        optimizer.step()
    
    # loss and accuracy for the complete epoch
    epoch_loss = train_running_loss / counter
    epoch_acc = 100. * (train_running_correct / len(trainloader.dataset))
    return epoch_loss, epoch_acc


def validate(model, testloader, criterion):
    model.eval()
    valid_running_loss = 0.0
    valid_running_correct = 0
    counter = 0
    with torch.no_grad():
        for i, data in enumerate(testloader):
            counter += 1
            
            image, labels = data
            image = image.to(device)
            labels = labels.to(device)
            # forward pass
            outputs = model(image)
            # calculate the loss
            loss = criterion(outputs, labels)
            valid_running_loss += loss.item()
            # calculate the accuracy
            _, preds = torch.max(outputs.data, 1)
            valid_running_correct += (preds == labels).sum().item()
        
    # loss and accuracy for the complete epoch
    epoch_loss = valid_running_loss / counter
    epoch_acc = 100. * (valid_running_correct / len(testloader.dataset))
    return epoch_loss, epoch_acc

This is probably not the best way to things but until we hear back from the maintainers ,this is what I cooked up .
Hope it helps

1 reply

shahariel Jun 11, 2023
Author

Thank you for your answer!
I actually think my solution (under "EDIT" in my question) is ok. It's not a problem that the batches are organized by the classes.

Answer selected by shahariel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Class balanced data loader #1399

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Class balanced data loader #1399

shahariel Jun 1, 2023

Replies: 1 comment · 1 reply

AymenTlili131 Jun 5, 2023

shahariel Jun 11, 2023 Author

shahariel
Jun 1, 2023

Replies: 1 comment 1 reply

AymenTlili131
Jun 5, 2023

shahariel Jun 11, 2023
Author