# **Constructing imbalanced datasets for CIFAR10, CIFAR100, Cat_vs_Dog, STL10**

* Author: Zhuoning Yuan

**Useful Resources**:
* Website: https://libauc.org
* Github: https://github.com/Optimization-AI/LibAUC

**Reference**:  

If you find this tutorial helpful in your work,  please acknowledge our library and cite the following paper:

<pre>
@inproceedings{yuan2021large,
  title={Large-scale robust deep auc maximization: A new surrogate loss and empirical studies on medical image classification},
  author={Yuan, Zhuoning and Yan, Yan and Sonka, Milan and Yang, Tianbao},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={3040--3049},
  year={2021}
}

@misc{libauc2022,
      title={LibAUC: A Deep Learning Library for X-Risk Optimization.},
      author={Zhuoning Yuan, Zi-Hao Qiu, Gang Li, Dixian Zhu, Zhishuai Guo, Quanqi Hu, Bokun Wang, Qi Qi, Yongjian Zhong, Tianbao Yang},
      year={2022}
    }
</pre>


# **01.Installing LibAUC**

In [1]:
!pip install libauc==1.1.9rc3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting libauc==1.1.9rc3
  Downloading libauc-1.1.9rc3-py3-none-any.whl (71 kB)
[K     |████████████████████████████████| 71 kB 3.4 MB/s 
Installing collected packages: libauc
Successfully installed libauc-1.1.9rc3


# **02. Loading Datasets**



### CIFAR10
* **Description**: The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
* **Homepage:** https://www.cs.toronto.edu/~kriz/cifar.html



In [3]:
from libauc.datasets import CIFAR10
(train_data, train_label) = CIFAR10(root='./data', train=True) 
(test_data, test_label) = CIFAR10(root='./data', train=False) 

Files already downloaded and verified
Files already downloaded and verified


### CIFAR100
* **Description**: This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses.
* **Homepage:** https://www.cs.toronto.edu/~kriz/cifar.html


In [4]:
from libauc.datasets import CIFAR100
(train_data, train_label) = CIFAR100(root='./data', train=True) 
(test_data, test_label) = CIFAR100(root='./data', train=False) 

Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ./data/cifar-100-python.tar.gz


  0%|          | 0/169001437 [00:00<?, ?it/s]

Extracting ./data/cifar-100-python.tar.gz to ./data
Files already downloaded and verified


### CAT_vs_DOG
* **Description**: The training archive contains 25,000 images of dogs and cats. Train your algorithm on these files and predict the labels for 1 = dog, 0 = cat.
* **Homepage:** https://www.kaggle.com/c/dogs-vs-cats/data



In [5]:
from libauc.datasets import CAT_VS_DOG

(train_data, train_label) = CAT_VS_DOG('./data/', train=True)
(test_data, test_label) = CAT_VS_DOG('./data/', train=False)

Downloading https://homepage.divms.uiowa.edu/~zhuoning/datasets/cat_vs_dog.tar.gz to ./data/cat_vs_dog.tar.gz


  0%|          | 0/233417984 [00:00<?, ?it/s]

Extracting ./data/cat_vs_dog.tar.gz to ./data/
Files already downloaded and verified



### STL10
* **Description**: The STL-10 dataset consists of 5000 96x96 colour images in 10 classes, with 500 images per class. There are 8000 test images, with 800 images per class. 
* **Homepage:**: https://ai.stanford.edu/~acoates/stl10/



In [6]:
from libauc.datasets import STL10
(train_data, train_label) = STL10(root='./data/', split='train') # return numpy array
(test_data, test_label) = STL10(root='./data/', split='test') # return numpy array

Downloading http://ai.stanford.edu/~acoates/stl10/stl10_binary.tar.gz to ./data/stl10_binary.tar.gz


  0%|          | 0/2640397119 [00:00<?, ?it/s]

Extracting ./data/stl10_binary.tar.gz to ./data/
Files already downloaded and verified


# **03. Constructing Imbalanced Datasets**



Import **`ImbalanceDataGenerator`** function, which will help you convert  input dataset to customized imbalanced binary dataset.

In [8]:
from libauc.utils import ImbalancedDataGenerator

Set *random_seed=123* and *imbalance_ratio=0.1*

In [7]:
SEED = 123
imratio = 0.1 # postive_samples/(total_samples)

We have the new imbalanced datasets, consisting of 2777 positive images and 25000 negative images for training set. For testing set, we keep them unchanged.

In [9]:
from libauc.datasets import CIFAR10
(train_data, train_label) = CIFAR10(root='./data', train=True) 
(test_data, test_label) = CIFAR10(root='./data', train=False) 

g = ImbalancedDataGenerator(verbose=True, random_seed=0)
(train_images, train_labels) = g.transform(train_data, train_label, imratio=imratio)
(test_images, test_labels) = g.transform(test_data, test_label, imratio=0.5) 


Files already downloaded and verified
Files already downloaded and verified
#SAMPLES: [27777], POS:NEG: [2777 : 25000], POS RATIO: 0.1000
#SAMPLES: [10000], POS:NEG: [5000 : 5000], POS RATIO: 0.5000


# **04. Preparing datasets for training with DataLoaders**

In [10]:
import torch
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
import numpy as np
from PIL import Image

class ImageDataset(Dataset):
    def __init__(self, images, targets, image_size=32, crop_size=30, mode='train'):
       self.images = images.astype(np.uint8)
       self.targets = targets
       self.mode = mode
       self.transform_train = transforms.Compose([                                                
                              transforms.ToTensor(),
                              transforms.RandomCrop((crop_size, crop_size), padding=None),
                              transforms.RandomHorizontalFlip(),
                              transforms.Resize((image_size, image_size)),
                              ])
       self.transform_test = transforms.Compose([
                             transforms.ToTensor(),
                             transforms.Resize((image_size, image_size)),
                              ])
    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = self.images[idx]
        target = self.targets[idx]
        image = Image.fromarray(image.astype('uint8'))
        if self.mode == 'train':
            image = self.transform_train(image)
        else:
            image = self.transform_test(image)
        return image, target
  

trainloader = DataLoader(ImageDataset(train_images, train_labels, mode='train'), batch_size=128, shuffle=True, num_workers=2, pin_memory=True)
testloader = DataLoader(ImageDataset(test_images, test_labels, mode='test'), batch_size=128, shuffle=False, num_workers=2,  pin_memory=True)

Now, we are ready to train models using the new imbalanced dataset. 