## CVPR2021 NAS track1 stage2 3rd Solution Illustration
Brought to you by **纳萨力克地下大坟墓 (Great Tomb of Nazarick)** from **Meta_Learners**, **THUMNLab**, **Tsinghua University**

**Team Member**: Chaoyu Guan (关超宇) (**Leader**), Yijian Qin (秦一鉴), Zhikun Wei (卫志坤), Zeyang Zhang (张泽阳), Zizhao Zhang (张紫昭), Xin Wang (王鑫)
### Table of content
- [Train](#I)
- [Evaluate](#II)
- [Tuning](#III)

### <a id="I">Part I. Train </a>
Before training, we need to prepare dataset under `./data` folder. Please first download the cifar100 dataset from [competition website](https://aistudio.baidu.com/aistudio/datasetdetail/76994) and move it to `./data/cifar-100-python.tar.gz`, and download the submit 5w archs from [competition website](https://aistudio.baidu.com/aistudio/datasetdetail/73326) to `./data/Track1_final_archs.json`

Then, we need to import the necessary modules.

In [None]:
import json
import pickle
from tqdm import tqdm

import paddle
import paddle.nn as nn
import paddle.nn.functional as F
import paddle.optimizer as opt
from paddle.optimizer.lr import CosineAnnealingDecay, LinearWarmup

from supernet.utils import seed_global, Dataset, str2arch, get_param
from supernet.sample import Generator, strict_fair_sample, uniform_sample
from supernet.super.supernet import Supernet
from supernet.super import super_bn, super_conv, super_fc

# BUG: there are some bugs when seeding paddlepaddle. Even the seed is set for paddlepaddle, the training procedure still has randomness.
# you can run this notebook several times and you will get different supernet performance even when paddle.seed(0) is called inside seed_global.
seed_global(0)

Then, we can build a supernet using shared conv, bn and fc module, and load the necessary data for training.

In [None]:
supernet = Supernet(super_bn.BestBN, super_conv.BestConv, super_fc.BestFC)
dataloader = Dataset().get_loader(batch_size=128, mode='train', num_workers=4)

Next, we train the supernet following paper ["Universally Slimmable Networks and Improved Training Techniques"](https://arxiv.org/abs/1903.05134). The basic idea is that, for one data batch, we randomly sample $n - 2$ architectures, together with two fixed architectures: The biggest architecture and the smallest architecture. For the biggest architecture, we use ground truth label of data batch for training. For the other architectures, we use knowledge distillation and use the biggest architecture as teacher to guide the training.

The idea behind is that the biggest and smallest architectures represent the performance upper and lower bound of the space. Optimizing both will result in better overall performances.

To get a better rank, we use basically the same training hyperparameters with each sub-architecture training procedure. And use channel-wise fair sampling method borrowed from ["FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search"](https://arxiv.org/abs/1907.01845) to sample the $n - 2$ architectures. The basic idea is that, for every $m\times k$ architectures sampled, every channel in certain layer appears exactly $m$ times, where $k$ is the number of choices that layer can choose.

**NOTE: The following cell will run about 10h to train the supernet on 1 Tesla V100.**

In [None]:
# Hyper Parameters
EPOCH = 300
LR = 0.1
WEIGHT_DECAY = 0.1     # we find that using weight decay 0 will result in better ranks. See chapter III. Tuning
GRAD_CLIP = 5          # we find that using gradient clip will result in slightly better ranks. See chapter III. tuning
SPACES = (
    [[4, 8, 12, 16]] * 7
    + [[4, 8, 12, 16, 20, 24, 28, 32]] * 6
    + [[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]] * 7
)
N = 18                 # We choose n=18 instead of a small n in paper, which we find is more suitable in this competition. See chapter III. tuning
SKIP_CONNECT = True    # We find that turning on skip connection will result in better ranks. See chapter III. tuning

# build an architecture generator sampled according to channel-wise fair sampling
generator = Generator(strict_fair_sample)
max_arch = [max(x) for x in SPACES]
min_arch = [min(x) for x in SPACES]

# build the optimizer
optimizer = opt.Momentum(
    LinearWarmup(CosineAnnealingDecay(LR, EPOCH), 2000, 0.0, LR),
    momentum=0.9,
    parameters=supernet.parameters(),
    weight_decay=WEIGHT_DECAY,
    grad_clip=None if GRAD_CLIP <= 0 else nn.ClipGradByGlobalNorm(GRAD_CLIP),
)
optimizer.clear_grad()

for e in tqdm(range(EPOCH), desc="total"):
    for data in tqdm(dataloader, desc=f"{e} epoch"):
        # sample archs
        archs = [generator() for _ in range(N - 2)] + [min_arch]
        
        # use ground truth label to train the biggest model.
        logit_big = supernet(data[0], max_arch, SKIP_CONNECT)
        loss = F.cross_entropy(logit_big, data[1]) / N
        loss.backward()

        # use output of big model to knowledge distill other models
        with paddle.no_grad():
            distribution = F.softmax(logit_big).detach()
        for arch in archs:
            logit = supernet(data[0], arch, SKIP_CONNECT)
            loss = F.cross_entropy(logit, distribution, soft_label=True) / N
            loss.backward()

        # update parameters of supernet
        optimizer.step()
        optimizer.clear_grad()

        # update scheduler
        if optimizer._learning_rate.last_epoch < optimizer._learning_rate.warmup_steps:
            optimizer._learning_rate.step()
        
        # save models for this epoch
        paddle.save(supernet.state_dict(), f'./saved_models/{e}-model.pdparams')

    # update scheduler
    if optimizer._learning_rate.last_epoch >= optimizer._learning_rate.warmup_steps:
        optimizer._learning_rate.step()

After training, we can evaluate all the trained supernets derived under `./saved_models` folder.

### <a id="II">Part II. Evaluate </a>
For evaluation, we simply extract the parameters (without running batch norm statistics) from supernet to form the parameters of every single model. The batch norm statistics are calculated on the fly using the data batch statistics from test set. We find that using this technique will result in better ranks. See [III. tuning](#III) for more details.

We also find that using negative cross entropy loss of single model on test dataset will result in better ranks with ground truth accuracy of model.

**NOTE: the cell below will cost about 8 hours running on one Tesla V100.**

In [None]:
SKIP_CONNECT = True

with paddle.no_grad():
    # build model, load best parameters
    supernet = Supernet(super_bn.BestBN, super_conv.BestConv, super_fc.BestFC)

    '''
    # use the line below to generate the 0.97122 result
    # supernet.set_state_dict(paddle.load('./saved_models/model_97122.pdparams'))
    
    # use the line below to generate the 0.97131 result
    # supernet.set_state_dict(paddle.load('./saved_models/model_97131.pdparams'))
    
    # the 97131 and 97122 is two models reruned following the notebook instruction above
    '''
    supernet.set_state_dict(paddle.load('./saved_models/255-model.pdparams'))

    supernet.eval()
    dataloader = Dataset().get_loader(batch_size=512, mode='test', num_workers=4)

    architecture = json.load(open('./data/Track1_final_archs.json', 'r'))
    for key in tqdm(architecture, desc="all archs"):
        arch = architecture[key]["arch"]
        arch = str2arch(arch)

        loss = 0.
        for data in tqdm(dataloader, desc="test dataset"):
            logit = supernet.inference(data[0], arch, SKIP_CONNECT)
            loss += float(F.cross_entropy(logit, data[1], reduction='sum'))
        architecture[key]["acc"] = - loss

# normalize architecture accs
max_acc = max([architecture[key]['acc'] for key in architecture])
min_acc = min([architecture[key]['acc'] for key in architecture])

for key in architecture:
    architecture[key]['acc'] = (architecture[key]['acc'] - min_acc) / (max_acc - min_acc)

# save the results
json.dump(architecture, open("./saved_models/result_5w.json", "w"))

The derived `./saved_models/result_5w.json` is the file we submit online.

### <a id="III">Part III. tuning</a>

In this chapter, we will introduce what we have explored during this competition, and share you the findings.

NOTE: since most of our results are derived using different version of codes, backends and structures, we __only provide the universal findings__ and __how to run these comparison experiments__. We do not report the exact rank scores since most of them are not runned under our newest version of codes.

#### III.a Influence of different sharing strategies

The first thing we've tested is how to share each parts of supernet. We can choose to share fc layer, share bn layer and share conv layer.
- For fc layer, there are two choices:
    - when not shared, every different last channel choice will use different fc layer. [__IndependentFC__]
    - when shared, the fc layer with small channel number will be derived from the biggest fc layer by taking the smallest $c$ channels, where c is the small channel number. [__FullFC__]
- For conv layer, there are 4 ways to do the share:
    - Independent. For one layer, every choice of [in_c, out_c] will use independent kernel with shape [in_c, out_c]. [__IndependentConv__]
    - Front share. For one layer, the conv op with the same input channel number will share the same kernel, wich the small kernel extracted from the shared kernel by choosing neurons with small ids. [__FrontShareConv__]
    - End share. For one layer, the conv op with the same output channel number will share the same kernel, wich the small kernel extracted from the shared kernel by choosing neurons with small ids. [__EndShareConv__]
    - Full share. Every op in one layer is a slice of the shared kernel by choosing the small ids. [__FullConv__]
- For bn layer, we can also have 4 ways to do the share, similar to the situation in conv layer.
    - Independent. Every choice of [in_c, out_c] will use independent bn ops of shape [out_c]. [__IndependentBN__]
    - Front share. For one layer, the bn op with the same conv input channel number will be shared. [__FrontShareBN__]
    - End share. For one layer, the bn op with the same conv output channel number will be shared. [__EndShareBN__]
    - Full share. Use the same bn for every op choices. [__FullBN__]

There are so many cross combinations, so we __do not iterate__ the whole space. Instead, we __start from the fully shared supernet (FullConv, FullBN, FullFC)__, and test whether the rank will improve when we change the share type of bn, conv and fc. Specially, we also test the basic combinations to use both _xxxConv_ and _xxxBN_, since they share the same logic when sharing.

In total, we test the following combinations:
1. FullConv, FullBN, FullFC
2. IndependentConv, FullBN, FullFC
3. FrontShareConv, FullBN, FullFC
4. EndShareConv, FullBN, FullFC
5. FullConv, IndependentBN, FullFC
6. FullConv, FrontShareBN, FullFC
7. FullConv, EndShareBN, FullFC
8. FullConv, FullBN, IndependentFC
9. FrontShareConv, FrontShareBN, FullFC
10. EndShareConv, EndShareBN, FullFC
11. IndependentConv, IndependentBN, FullFC

You can use the following command to run every combination by replacing xxxconv xxbn xxfc to the corresponding classes.

In [None]:
!python -m supernet.scripts.train --convclass xxxconv --bnclass xxbn --fcclass xxfc

Conclusion: __Fully shared supernet performs best__.

#### III.b sample strategy

The second exploration is how to sample architectures to train the supernet. We mainly explore 3 strategies.

- uniform sample. Sample uniformly.
- fair sample. Sample fairly following ["FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search"](https://arxiv.org/abs/1907.01845)
- sample according to architecture parameter size. Architectures with bigger parameter size will have higher probability to be sampled.

Using following commands to explore sample strategy.

In [None]:
# first, generate the archs for `sampling according to architecture parameter size`
archs = uniform_sample(1000000)
arch_param = [get_param(a)[1] / 1000000 for a in archs]
sums = sum(arch_param)
arch_prob = [x / sums for x in arch_param]

pickle.dump(archs, open('./data/arch_1m.bin', 'wb'))
pickle.dump(arch_prob, open('./data/arch_1m_prob.bin', 'wb'))

In [None]:
!python -m supernet.scripts.train --sample uniform
!python -m supernet.scripts.train --sample fair
!python -m supernet.scripts.train --sample fixed --train_arch ./data/arch_1m.bin --train_arch_prob ./data/arch_1m_prob.bin

Conclusion: fair sample performs slightly better among all results.

#### III.c Sandwich rule, knowledge distillation
To test whether the sandwich rule and knowledge distillation make effects, we also explore the use of both. You can derive the supernet by running the following codes.

In [None]:
# use sandwich rule and knowledge distillation
!python -m supernet.scripts.train --n 6 --kd --sandwich
# use sandwich rule only
!python -m supernet.scripts.train --n 6 --sandwich
# do not use any
!python -m supernet.scripts.train --n 6

# use small n to rerun
!python -m supernet.scripts.train --n 3 --kd --sandwich
!python -m supernet.scripts.train --n 3 --sandwich
!python -m supernet.scripts.train --n 3

Conclusion: sandwich rule will only be effective when knowledge distillation is on and when $n$ is large(r than $5$). Sandwich rule + knowledge distillation can result in higher kendall score, while do not use both will result in higher pearson score.

#### III.d Hyper parameters
We also tune the lr, weight decay, gradient clip, training epoch, etc. You can use the following command to run.

In [None]:
!python -m supernet.scripts.train --weight_decay xxx --lr xxx --epoch xxx --clip xxx

Conclusion: weight decay should be 0, gradient clip = 5 is better. lr and epoch should remain the same with training ground truth: lr=0.1, epoch=300

#### III.e N
The main last hyper parameter during training we explore is N.

In [None]:
!python -m supernet.scripts.train --n xxx --kd --sandwich

Conclusion: n=18 is the best

#### III.f Evaluation

We explore the evaluation methods. There are mainly two fields to explore.
- The metric to use
    - negative cross entropy loss [loss]
    - top1 accuracy [acc]
- The bn statistics to use
    - those in supernet during training [0]
    - calibrate them on training dataset [1]
    - calibrate them on test dataset [2]
    - calculate on the fly during testing [3]

You can evaluate them using following commands:


In [None]:
!python -m supernet.scripts.evaluate --path path/to/pdparams --path_to_arch path/to/arch --metric loss or acc --bn_mode 0 or 1 or 2 or 3

Conclusion: loss is better than acc, while calibrate bn performs similar with calculate on the fly. Directly using supernet statistics performs worst. For speed reason, we use 3 at last.

#### III.g Others

We've also test some other strategies, which is not so systematic or important, so we omit them in this notebook, you can refer to `./supernet/scripts/train` for more details.