# **Optimizing Partial AUC Loss on Imbalanaced Dataset**

**Author**: Zhuoning Yuan

**Introduction**

In this tutorial, you will learn how to quickly train a ResNet18 model by optimizing **Partial AUC (pAUC)** score using our novel optimization methods on an binary image classification task on Cifar10. This is a **wrapper** of original implementations of partial AUC losses. For orginal tutorials, please refer to  [SOPA](https://github.com/Optimization-AI/LibAUC/blob/main/examples/11_Optimizing_pAUC_Loss_with_SOPA_on_Imbalanced_data.ipynb), [SOPA-s](https://github.com/Optimization-AI/LibAUC/blob/main/examples/11_Optimizing_pAUC_Loss_with_SOPAs_on_Imbalanced_data.ipynb), [SOTA-s](https://github.com/Optimization-AI/LibAUC/blob/main/examples/11_Optimizing_pAUC_Loss_with_SOTAs_on_Imbalanced_data.ipynb). After completion of this tutorial, you should be able to use LibAUC to train your own models on your own datasets.

**Useful Resources**

* Website: https://libauc.org
* Github: https://github.com/Optimization-AI/LibAUC


**References**

If you find this tutorial helpful in your work,  please acknowledge our library and cite the following papers:

<pre>
@article{zhu2022auc,
  title={When AUC meets DRO: Optimizing Partial AUC for Deep Learning with Non-Convex Convergence Guarantee},
  author={Zhu, Dixian and Li, Gang and Wang, Bokun and Wu, Xiaodong and Yang, Tianbao},
  journal={arXiv preprint arXiv:2203.00176},
  year={2022}
}

@misc{libauc2022,
	title={LibAUC: A Deep Learning Library for X-risk Optimization.},
	author={Zhuoning Yuan, Zi-Hao Qiu, Gang Li, Dixian Zhu, Zhishuai Guo, Quanqi Hu, Bokun Wang, Qi Qi, Yongjian Zhong, Tianbao Yang},
	year={2022}
	}
</pre>

## **Installing LibAUC**

Let's start with install our library here. In this tutorial, we will use beta version `1.1.9rc4`.

In [None]:
!pip install libauc==1.1.9rc4

## **Importing LibAUC**

Import required libraries to use




In [None]:
from libauc.models import resnet18
from libauc.datasets import CIFAR10
from libauc.losses.auc import pAUCLoss  # default: SOPA
from libauc.optimizers import SOPA
from libauc.utils import ImbalancedDataGenerator
from libauc.sampler import DualSampler
from libauc.metrics import pauc_roc_score

import torch 
from PIL import Image
import numpy as np
import torchvision.transforms as transforms
from torch.utils.data import Dataset
from torch.utils.data import Dataset
import torch 

import warnings
warnings.filterwarnings("ignore")

## **Reproducibility**

These functions limit the number of sources of randomness behaviors, such as model intialization, data shuffling, etcs. However, completely reproducible results are not guaranteed across PyTorch releases [[Ref]](https://pytorch.org/docs/stable/notes/randomness.html#:~:text=Completely%20reproducible%20results%20are%20not,even%20when%20using%20identical%20seeds.).

In [None]:
def set_all_seeds(SEED):
    # REPRODUCIBILITY
    torch.manual_seed(SEED)
    np.random.seed(SEED)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

## **Image Dataset**


Now that we defined the data input pipeline such as data augmentations. In this tutorials, we use `RandomCrop`, `RandomHorizontalFlip`. The `pos_index_map` helps map global index to local index for reducing memory cost in loss function since we only need to track the indices for positive samples. Please refer to original paper [here](https://arxiv.org/pdf/2203.00176.pdf) for more details.




In [None]:
class ImageDataset(Dataset):
    def __init__(self, images, targets, image_size=32, crop_size=30, mode='train'):
       self.images = images.astype(np.uint8)
       self.targets = targets
       self.mode = mode
       self.transform_train = transforms.Compose([                                                
                              transforms.ToTensor(),
                              transforms.RandomCrop((crop_size, crop_size), padding=None),
                              transforms.RandomHorizontalFlip(),
                              transforms.Resize((image_size, image_size)),
                              ])
       self.transform_test = transforms.Compose([
                             transforms.ToTensor(),
                             transforms.Resize((image_size, image_size)),
                              ])
       
       # for loss function
       self.pos_indices = np.flatnonzero(targets==1)
       self.pos_index_map = {}
       for i, idx in enumerate(self.pos_indices):
           self.pos_index_map[idx] = i

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        image = self.images[idx]
        target = self.targets[idx]
        image = Image.fromarray(image.astype('uint8'))
        if self.mode == 'train':
            idx = self.pos_index_map[idx] if idx in self.pos_indices else -1
            image = self.transform_train(image)
        else:
            image = self.transform_test(image)
        return image, target, int(idx)

## **Hyper-parameters**

In [None]:
# general params
lr = 1e-3
weight_decay = 5e-4
total_epoch = 60
decay_epochs = [20, 40]
batch_size = 64

# By default, we use one-way partial AUC
alpha = 0.  # a: min_tpr=0. This is fixed value (for reference only)
beta = 0.1  # b: max_fpr=0.1

# By default, pAUCLoss calls SOPA in the backend
margin = 1.0
eta = 1e1 # learning rate for control negative samples weights

# sampling parameters
sampling_rate = 0.5
num_pos = int(batch_size*sampling_rate)
num_neg = int(batch_size*(1-sampling_rate))

## **Loading datasets**

In this step, , we will use the [CIFAR10](https://www.cs.toronto.edu/~kriz/cifar.html) as benchmark dataset. Before importing data to `dataloader`, we construct imbalanced version for CIFAR10 by `ImbalancedDataGenerator`. Specifically, it first randomly splits the training data by class ID (e.g., 10 classes) into two even portions as the positive and negative classes, and then it randomly removes some samples from the positive class to make
it imbalanced. We keep the testing set untouched. We refer `imratio` to the ratio of number of positive examples to number of all examples. 

In [None]:
train_data, train_targets = CIFAR10(root='./data', train=True)
test_data, test_targets  = CIFAR10(root='./data', train=False)

imratio = 0.2
generator = ImbalancedDataGenerator(shuffle=True, verbose=True, random_seed=0)
(train_images, train_labels) = generator.transform(train_data, train_targets, imratio=imratio)
(test_images, test_labels) = generator.transform(test_data, test_targets, imratio=0.5) 

trainDataset = ImageDataset(train_images, train_labels)
testDataset = ImageDataset(test_images, test_labels, mode='test')

Files already downloaded and verified
Files already downloaded and verified
#SAMPLES: [31250], POS:NEG: [6250 : 25000], POS RATIO: 0.2000
#SAMPLES: [10000], POS:NEG: [5000 : 5000], POS RATIO: 0.5000


Now, we can import data and load the dataset!

In [None]:
trainSet = ImageDataset(train_images, train_labels)
testSet = ImageDataset(test_images, test_labels, mode='test')

sampler = DualSampler(trainSet, batch_size, sampling_rate=sampling_rate)
trainloader = torch.utils.data.DataLoader(trainSet, batch_size=batch_size,  sampler=sampler,  shuffle=False,  num_workers=1)
testloader = torch.utils.data.DataLoader(testSet , batch_size=batch_size, shuffle=False, num_workers=1)

## **Model and Loss setup**
For `pAUCLoss`, you can set `backend='SOPAs, SOPA, SOTA-s`. Here is the breif summary for each loss below. For more details regarding the parameters, please refer to original tutorials. 


- **SOPA** [[Ref](https://github.com/Optimization-AI/LibAUC/blob/main/examples/11_Optimizing_pAUC_Loss_with_SOPA_on_Imbalanced_data.ipynb)]
```
Loss = pAUC_CVaR_Loss(pos_length=sampler.pos_len, num_neg=num_neg, beta=beta)
optimizer = SOPA(model, loss_fn=loss_fn, mode='adam', lr=lr, eta=eta, weight_decay=weight_decay)
```

- **SOPAs** [[Ref](https://github.com/Optimization-AI/LibAUC/blob/main/examples/11_Optimizing_pAUC_Loss_with_SOPAs_on_Imbalanced_data.ipynb)]
```
loss_fn = pAUC_DRO_Loss(pos_len=sampler.pos_len, margin=margin, beta=beta, Lambda=Lambda)
optimizer = SOPAs(model, loss_fn=loss_fn, mode='adam', lr=lr, weight_decay=weight_decay)
```

- **SOTAs** [[Ref](https://github.com/Optimization-AI/LibAUC/blob/main/examples/11_Optimizing_pAUC_Loss_with_SOTAs_on_Imbalanced_data.ipynb)]
```
Loss = tpAUC_KL_Loss(pos_len=sampler.pos_len, Lambda=Lambda, tau=tau)
optimizer = SOTAs(model, loss_fn=loss_fn, mode='adam', lr=lr, gammas=(gamma0, gamma1), weight_decay=weight_decay) 
```

In [None]:
seed = 123
set_all_seeds(seed)
model = resnet18(pretrained=False, num_classes=1, last_activation=None) 
model = model.cuda()

loss_fn = pAUCLoss(pos_len=sampler.pos_len, backend='SOPA', beta=beta, num_neg=num_neg, margin=margin)
optimizer = SOPA(model.parameters(), loss_fn=loss_fn.loss_fn, mode='adam', lr=lr, eta=eta, weight_decay=weight_decay)

Backend loss: SOPA


## **Training**

Now we start training the model. We evaluate partial AUC performance with True Positive Rate (TPR) equal to 0 (`alpha=0`) and False Positive Rate (FPR) less than or equal to 0.3 (`beta=0.3`). This can be done by `libauc.metrics.pauc_roc_score(y_true, y_pred, max_fpr=0.3, min_tpr=0)`.

In [None]:
print ('Start Training')
print ('-'*30)
test_best = 0
train_list, test_list = [], []
for epoch in range(total_epoch):
    
    if epoch in decay_epochs:
        optimizer.update_lr(decay_factor=10, coef_decay_factor=10)
            
    train_pred, train_true = [], []
    model.train() 
    for idx, (data, targets, index) in enumerate(trainloader):
        data, targets  = data.cuda(), targets.cuda()
        y_pred = model(data)
        y_prob = torch.sigmoid(y_pred)
        loss = loss_fn(y_prob, targets, index_p=index) # make sure: index >0 for positive samples, and index < 0 for negative samples
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        train_pred.append(y_prob.cpu().detach().numpy())
        train_true.append(targets.cpu().detach().numpy())

    train_true = np.concatenate(train_true)
    train_pred = np.concatenate(train_pred)
    train_pauc = pauc_roc_score(train_true, train_pred, max_fpr=0.3)
    train_list.append(train_pauc)
    
   # evaluation
    model.eval()
    test_pred, test_true = [], [] 
    for j, data in enumerate(testloader):
        test_data, test_targets, index = data
        test_data = test_data.cuda()
        y_pred = model(test_data)
        y_prob = torch.sigmoid(y_pred)
        test_pred.append(y_prob.cpu().detach().numpy())
        test_true.append(test_targets.numpy())
    test_true = np.concatenate(test_true)
    test_pred = np.concatenate(test_pred)
    val_pauc = pauc_roc_score(test_true, test_pred, max_fpr=0.3)
    test_list.append(val_pauc)
    
    if test_best < val_pauc:
       test_best = val_pauc
    
    model.train()
    print("epoch: %s, lr: %.4f, train_pauc: %.4f, test_pauc: %.4f, test_best: %.4f"%(epoch, optimizer.lr, train_pauc, val_pauc, test_best))
    

Start Training
------------------------------
epoch: 0, lr: 0.0010, train_pauc: 0.6013, test_pauc: 0.6739, test_best: 0.6739
epoch: 1, lr: 0.0010, train_pauc: 0.7130, test_pauc: 0.7297, test_best: 0.7297
epoch: 2, lr: 0.0010, train_pauc: 0.7643, test_pauc: 0.7690, test_best: 0.7690
epoch: 3, lr: 0.0010, train_pauc: 0.8009, test_pauc: 0.7667, test_best: 0.7690
epoch: 4, lr: 0.0010, train_pauc: 0.8294, test_pauc: 0.7836, test_best: 0.7836
epoch: 5, lr: 0.0010, train_pauc: 0.8523, test_pauc: 0.8172, test_best: 0.8172
epoch: 6, lr: 0.0010, train_pauc: 0.8672, test_pauc: 0.8204, test_best: 0.8204
epoch: 7, lr: 0.0010, train_pauc: 0.8866, test_pauc: 0.8356, test_best: 0.8356
epoch: 8, lr: 0.0010, train_pauc: 0.9022, test_pauc: 0.8321, test_best: 0.8356
epoch: 9, lr: 0.0010, train_pauc: 0.9130, test_pauc: 0.8010, test_best: 0.8356
epoch: 10, lr: 0.0010, train_pauc: 0.9225, test_pauc: 0.8252, test_best: 0.8356
epoch: 11, lr: 0.0010, train_pauc: 0.9324, test_pauc: 0.8377, test_best: 0.8377
epoc