# Loading custom CelebA dataset for CC-VAE

We explain how the `data_loader` work, and how to adapt the the `setup_data_loader` function to preprocess other datasets for training with `ss_vae.py`

**NB:** To startthe training, make sure you have the folder *./data/output* created (otherwise it will throw an error).

In [2]:
import torch
from typing import Any, Callable, List, Optional, Tuple, Union
import csv
from collections import namedtuple
import os
import math
from utils import dataset_cached as dc
from utils import multiclass_dataset_cached as mdc

## 1 - Understanding CelebA loading process

In the `root` argument, we have to specify the path to our dataset. For our work, the datasets are in **.data/datasets/** and you can find *CelebA, Mnist or Chexput*.

The `setup_data_loader` function returns an dictionary of the form `{mode(string): Dataset}`.

It takes many arguments, but only the `root` interests us, as all the others are reusable for any dataset (batch_size, ...).
**Each dataset we take as input will be first transformed into a `Dataset` object before setting the data loader.**

The Dataset type is inherited from the `CelebA` class, which itself inhertis from the `VisionDataset` class.
We have the following inheritance graph:
- **CelebaCached <- CelebA <- VisionDataset <- data.Dataset**, where the 3 last classes are built-in scripts that come within the *torchvision* package.


### Load CelebA data

The easiest approach to work on subparts of CelebA is to load a Dataset of the right shape and type with the provided `CelebeCached` class, and then randomly discarding data to make it fit the shape we want.

**How to make it work:** Authors have set a verification process to avoid downloading infected data. 
If you want to work with downloaded data (available on the CelebA dataset), you will have to:
- Download the code from our github (we modified `CelebACached` class)
- In the torchvision package, in the `Celeba.py` file, add the following two lines:
1. Add `check_itr = True` as the last argument of the `__init__` method (**line 67**)
2. Add `if check_itr:` around the `if self._check_integrity():` block  (**line 82**)

That way, we circumvent the obliagtion to run the integrity test, and can correctly load our dataset.

In [4]:
loaders = dc.setup_data_loaders(False, 1, root="./data/datasets/celeba", download=False)

Loading original binary classes dataset
Splitting Dataset
Loading original binary classes dataset
Loading original binary classes dataset


In [5]:
loaders

{'sup': <torch.utils.data.dataloader.DataLoader at 0x7fd2073fd580>,
 'test': <torch.utils.data.dataloader.DataLoader at 0x7fd1fe8548e0>,
 'valid': <torch.utils.data.dataloader.DataLoader at 0x7fd1e81cc8b0>}

### Visualizing Shape of Loaders

In [6]:
for l in loaders:
    dataloader = loaders[l]
    print(f'\nDatastet Name {l}')
    print(f'Characteristics {dataloader.dataset}')

    print(f'Elements of dataset are {type(dataloader.dataset[0])} of shape {len(dataloader.dataset[0])}:')
    print(f'\tElt 1 is a {type(dataloader.dataset[0][0])} of shape {(dataloader.dataset[0][0]).shape}')
    print(f'\tElt 2 is a {type(dataloader.dataset[0][1])} of shape {(dataloader.dataset[0][1]).shape}')



Datastet Name sup
Characteristics Dataset CELEBACached
    Number of datapoints: 142770
    Root location: ./data/datasets/celeba
    Target type: ['attr']
    Split: train
Elements of dataset are <class 'tuple'> of shape 2:
	Elt 1 is a <class 'torch.Tensor'> of shape torch.Size([3, 64, 64])
	Elt 2 is a <class 'torch.Tensor'> of shape torch.Size([18])

Datastet Name test
Characteristics Dataset CELEBACached
    Number of datapoints: 19962
    Root location: ./data/datasets/celeba
    Target type: ['attr']
    Split: test
Elements of dataset are <class 'tuple'> of shape 2:
	Elt 1 is a <class 'torch.Tensor'> of shape torch.Size([3, 64, 64])
	Elt 2 is a <class 'torch.Tensor'> of shape torch.Size([18])

Datastet Name valid
Characteristics Dataset CELEBACached
    Number of datapoints: 20000
    Root location: ./data/datasets/celeba
    Target type: ['attr']
    Split: train
Elements of dataset are <class 'tuple'> of shape 2:
	Elt 1 is a <class 'torch.Tensor'> of shape torch.Size([3, 64, 6

### Pruning Datasets to train on lower data

Again, be sure to download the code from the github, as we added the `random_prune_dataloader` in the `dataset_cached.py` file.

In [7]:
# Let's now prune our dataloaders to get less samples

sup_dl = loaders['sup']
pruning_ratio= 0.7 # We want to keep 70% of the original dataset

pruned_sup_dl = dc.random_prune_dataloader(sup_dl, prune_ratio=pruning_ratio)

In [8]:
print(f'Original dataset of shape {len(sup_dl.dataset)}')
print(f'Pruned dataset of shape {len(pruned_sup_dl.dataset)}')

Original dataset of shape 142770
Pruned dataset of shape 99939


In [9]:
CelebaCached_dl = pruned_sup_dl.dataset.dataset

CelebaCached_dl.sub_label_inds # indexes of celeba labels that are in easy labels

[1, 3, 5, 8, 9, 11, 12, 13, 15, 18, 20, 24, 26, 28, 31, 33, 38, 39]

In [10]:
len(CelebaCached_dl.attr_names) # all the attributes are present

40

In [11]:
CelebaCached_dl.train_labels_sup

tensor([[0, 1, 1,  ..., 0, 0, 1],
        [0, 0, 0,  ..., 0, 0, 1],
        [0, 0, 0,  ..., 0, 0, 1],
        ...,
        [0, 1, 1,  ..., 0, 0, 1],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 1,  ..., 0, 0, 1]])

In [12]:
CelebaCached_dl.prior

tensor([0.2656, 0.2037, 0.1518, 0.2385, 0.1495, 0.2036, 0.1437, 0.0574, 0.0645,
        0.3843, 0.4193, 0.8343, 0.0430, 0.0799, 0.4791, 0.3200, 0.0730, 0.7786])

In [13]:
CelebaCached_dl.identity # Attributes in the identity.txt file

tensor([[2880],
        [2937],
        [8692],
        ...,
        [7391],
        [8610],
        [2304]])

In [14]:
CelebaCached_dl.filename

['000001.jpg',
 '000002.jpg',
 '000003.jpg',
 '000004.jpg',
 '000005.jpg',
 '000006.jpg',
 '000007.jpg',
 '000008.jpg',
 '000009.jpg',
 '000010.jpg',
 '000011.jpg',
 '000012.jpg',
 '000013.jpg',
 '000014.jpg',
 '000015.jpg',
 '000016.jpg',
 '000017.jpg',
 '000018.jpg',
 '000019.jpg',
 '000020.jpg',
 '000021.jpg',
 '000022.jpg',
 '000023.jpg',
 '000024.jpg',
 '000025.jpg',
 '000026.jpg',
 '000027.jpg',
 '000028.jpg',
 '000029.jpg',
 '000030.jpg',
 '000031.jpg',
 '000032.jpg',
 '000033.jpg',
 '000034.jpg',
 '000035.jpg',
 '000036.jpg',
 '000037.jpg',
 '000038.jpg',
 '000039.jpg',
 '000040.jpg',
 '000041.jpg',
 '000042.jpg',
 '000043.jpg',
 '000044.jpg',
 '000045.jpg',
 '000046.jpg',
 '000047.jpg',
 '000048.jpg',
 '000049.jpg',
 '000050.jpg',
 '000051.jpg',
 '000052.jpg',
 '000053.jpg',
 '000054.jpg',
 '000055.jpg',
 '000056.jpg',
 '000057.jpg',
 '000058.jpg',
 '000059.jpg',
 '000060.jpg',
 '000061.jpg',
 '000062.jpg',
 '000063.jpg',
 '000064.jpg',
 '000065.jpg',
 '000066.jpg',
 '000067.j

## 2 - Training in a Multi-label fashion

This time, we do no have binary labels (attractive or not, brown hair or not, ...), but rather a label that can be classified in multiple ouput (hair type: brown, blond, black, bald).
Thus, the prior we have is $\frac{1}{n_{classes}}$. 

In `CelebaCached`, the prior are not values anymore (probabilities between 0 and 1: at the beginning the mean over all the datapoints), but rather a `Categorical` distribution that compute probabilities for each of the class $c \in [1: n_{classes}]$.

To have thid part runing, you have to do the following changes:
- Use *multiclass_dataset_cached.py* instead of *dataset_cached.py*
- Change the *celeba.py* file of the torchvision library (`__init__` function, `_load_csv` function (you can find it below))

**Note that even if this is adapted for multiclass, it should work also on the originial, binary dataset**

### **IMPORTANT:** 
- The Mutli-class labels must be named in the dataset (*.txt* files) 'classname_MULTI' (otherwise, will be transformed as if they were binary classes)
- Again, don't forget to pass the `multi_class = True` keyword argument in the `setup_data_loader` function

\_\_init\_\_ function **COPY THIS**:
```
    def __init__(
        self,
        root: str,
        split: str = "train",
        target_type: Union[List[str], str] = "attr",
        transform: Optional[Callable] = None,
        target_transform: Optional[Callable] = None,
        download: bool = False,
        check_itr = True,
        multi_class = False
    ) -> None:
        super().__init__(root, transform=transform, target_transform=target_transform)
        self.split = split
        if isinstance(target_type, list):
            self.target_type = target_type
        else:
            self.target_type = [target_type]

        if not self.target_type and self.target_transform is not None:
            raise RuntimeError("target_transform is specified but target_type is empty")

        if download:
            self.download()

        if(check_itr):
            if not self._check_integrity():
                raise RuntimeError("Dataset not found or corrupted. You can use download=True to download it")

        split_map = {
            "train": 0,
            "valid": 1,
            "test": 2,
            "all": None,
        }
        split_ = split_map[verify_str_arg(split.lower(), "split", ("train", "valid", "test", "all"))]

        if(multi_class):
            print('Loading MULTI_CLASS dataset')
            attr = self._load_csv("list_attr_multiclass_celeba.txt", header=1)
            splits = self._load_csv("list_eval_partition_multiclass.txt")
            identity = self._load_csv("identity_multiclass_CelebA.txt")
            bbox = self._load_csv("list_bbox_multiclass_celeba.txt", header=1)
            landmarks_align = self._load_csv("list_landmarks_align_multiclass_celeba.txt", header=1)

        else:
            print('Loading original binary classes dataset')
            attr = self._load_csv("list_attr_celeba.txt", header=1)
            splits = self._load_csv("list_eval_partition.txt")
            identity = self._load_csv("identity_CelebA.txt")
            bbox = self._load_csv("list_bbox_celeba.txt", header=1)
            landmarks_align = self._load_csv("list_landmarks_align_celeba.txt", header=1)

        mask = slice(None) if split_ is None else (splits.data == split_).squeeze()

        if mask == slice(None):  # if split == "all"
            self.filename = splits.index
        else:
            self.filename = [splits.index[i] for i in torch.squeeze(torch.nonzero(mask))]
        self.identity = identity.data[mask]
        self.bbox = bbox.data[mask]
        self.landmarks_align = landmarks_align.data[mask]
        self.attr = attr.data[mask]
        self.attr_names = attr.header
        # map from {-1, 1} to {0, 1} if not multi-class label
        for i, lab in enumerate(self.attr_names):
            if('MULTI' not in lab): # dealing with multi class labels
                self.attr[:, i] = torch.div(self.attr[:, i] + 1, 2, rounding_mode="floor")
```

Here we show how to access the priors of the Categorical Distribution

In [15]:
CELEBA_MULTI_LABELS = {'Hair': 5} # Multi-mabel: has 5 possible rankings

prior = []
dict_index_class = {}
i=0

for label, n_classes in CELEBA_MULTI_LABELS.items():
    uniform_distrib = [1/n_classes for i in range(n_classes)]

    uniform_tensor = torch.tensor(uniform_distrib).unsqueeze(0)

    prior.append(torch.distributions.categorical.Categorical(probs= uniform_tensor))
    dict_index_class[i] = label
    i+=1

dict_class_index = {v:k for k,v in dict_index_class.items()}
prior[0].probs[0][0] # First categorical distribution, row 1 (always), n_classes probas

tensor(0.2000)

And how we compute it in the \_\_init\_\_function of the *multiclass_dataset_cached.py* file. In accordance with the original method, we compute the empiric probability of appearance of a given class for a given label in the dataset.

In [16]:
# Toy dataset
tensor = torch.tensor([
    [0, 1, 2],
    [1, 1, 1],
    [0, 1, 0],
    [1, 0, 1]
], dtype=torch.float32)

# Toy dict, index:nb_classes
dict_test = {0: 2, 1:2, 2:2}

probs_prior = []
data_size = tensor.shape[0] # Dataset size
for lab_idx, nclass in dict_test.items():
    # computing probs_prior for label lab_idx
    probs_lab = [] 
    for j in range(nclass):
        probs_lab.append(torch.sum(tensor[:, lab_idx] == j).item()/data_size) # Counting the nb of appearances of each class, divided by total samples
    probs_prior.append(probs_lab)

print(probs_prior)
prior = [torch.distributions.categorical.Categorical(probs= torch.tensor(lab_prior).unsqueeze(0)) for lab_prior in probs_prior]
for p in prior:
    print(p.probs)

[[0.5, 0.5], [0.25, 0.75], [0.25, 0.5]]
tensor([[0.5000, 0.5000]])
tensor([[0.2500, 0.7500]])
tensor([[0.3333, 0.6667]])


_load_csv function **COPY THIS**:
```
def _load_csv(
    self,
    filename: str,
    header: Optional[int] = None,
) -> CSV:
    with open(os.path.join(self.root, self.base_folder, filename)) as csv_file:
        data = list(csv.reader(csv_file, delimiter=" ", skipinitialspace=True))

    if header is not None:
        headers = data[header]
        data = data[header + 1 :]
    else:
        headers = []

    indices = [row[0] for row in data]
    data = [row[1:] for row in data]
    # Delete rows with nans
    indices_int = []
    data_int = []
    for i, row in enumerate(data):
        try:
            data_int.append(list(map(int,map(float,row))))
            indices_int.append(indices[i])
        except Exception as e:
            pass # If error (nan), we discard the row

    return CSV(headers, indices_int, torch.tensor(data_int))
```

In [17]:
# This function is slightly modified to work in here, but is an exemple of the how the modifed function work
CSV = namedtuple("CSV", ["header", "index", "data"])

def load_csv(path,
    header: Optional[int] = None,
) -> CSV:
    with open(path) as csv_file:
        data = list(csv.reader(csv_file, delimiter=" ", skipinitialspace=True))

    if header is not None:
        headers = data[header]
        data = data[header + 1 :]
    else:
        headers = []
    # Remove lines containing NaN
    indices = [row[0] for row in data]
    data = [row[1:] for row in data]
    indices_int = []
    data_int = []
    for i, row in enumerate(data):
        try:
            data_int.append(list(map(int,map(float,row))))
            indices_int.append(indices[i])
        except Exception as e:
            pass

    return CSV(headers, indices_int, torch.tensor(data_int))

Size of all dataset: 202599

Size of data set without Nan: 124892

In [18]:
# Try with originial dataset to see if sthg has changed
csv_attr = load_csv("./data/datasets/celeba/celeba/list_attr_celeba.txt", header=1)
len(csv_attr[1])

202599

In [20]:
# Try with multiclass dataset to see if size is correct
csv_attr = load_csv("./data/datasets/celeba/celeba/list_attr_multiclass_celeba.txt", header=1)
len(csv_attr[1])

124892

Load the multiclass dataset

In [22]:
loaders = mdc.setup_data_loaders(False, 1, root="./data/datasets/celeba", download=False, multi_class=True)

Loading MULTI_CLASS dataset
Splitting Dataset
Loading MULTI_CLASS dataset
Loading MULTI_CLASS dataset


In [23]:
loaders

{'sup': <torch.utils.data.dataloader.DataLoader at 0x7fd2058880a0>,
 'test': <torch.utils.data.dataloader.DataLoader at 0x7fd1e8b6baf0>,
 'valid': <torch.utils.data.dataloader.DataLoader at 0x7fd1c85ed040>}

In [24]:
for l in loaders:
    dataloader = loaders[l]
    print(f'\nDatastet Name {l}')
    print(f'Characteristics {dataloader.dataset}')

    print(f'Elements of dataset are {type(dataloader.dataset[0])} of shape {len(dataloader.dataset[0])}:')
    print(f'\tElt 1 is a {type(dataloader.dataset[0][0])} of shape {(dataloader.dataset[0][0]).shape}')
    print(f'\tElt 2 is a {type(dataloader.dataset[0][1])} of shape {(dataloader.dataset[0][1]).shape}')



Datastet Name sup
Characteristics Dataset CELEBACached
    Number of datapoints: 80247
    Root location: ./data/datasets/celeba
    Target type: ['attr']
    Split: train
Elements of dataset are <class 'tuple'> of shape 2:
	Elt 1 is a <class 'torch.Tensor'> of shape torch.Size([3, 64, 64])
	Elt 2 is a <class 'torch.Tensor'> of shape torch.Size([1])

Datastet Name test
Characteristics Dataset CELEBACached
    Number of datapoints: 12186
    Root location: ./data/datasets/celeba
    Target type: ['attr']
    Split: test
Elements of dataset are <class 'tuple'> of shape 2:
	Elt 1 is a <class 'torch.Tensor'> of shape torch.Size([3, 64, 64])
	Elt 2 is a <class 'torch.Tensor'> of shape torch.Size([1])

Datastet Name valid
Characteristics Dataset CELEBACached
    Number of datapoints: 20000
    Root location: ./data/datasets/celeba
    Target type: ['attr']
    Split: train
Elements of dataset are <class 'tuple'> of shape 2:
	Elt 1 is a <class 'torch.Tensor'> of shape torch.Size([3, 64, 64])

Now, we get almost the same loaders when using multi-class, but there are some slight differences:
- The tensors in the dataset will be of the size of `CELEBA_MULTI_LABELS` (against 18 for the original `CELEBA_EASY_LABELS`). **Be sure to add the labels you want to train on in this dictionary. WARNING: the hair-colour labels have been erased for multi-class (united in a *Hair_MULTI* label)**
- The prior function is no more a `tensor((1, n_labels), dtype=float)` but a **list of categorical distributions L**. For one multi-class label, the prior can be accessed with `L[label_idx].probs[0]` (because the probs object is itself a tensor of size $(1, n_{classes})$, so we get the column). In a way, this architecture is agnostic. The binary labels will also have a categorical distribution with 2 classes, that is, a Bernoulli distribution.

When dealing with mutli-class labels, use the `setup_data_loaders` function of the *multilabel_dataset_cached.py* file.