## Getting started with NAS

This notebook describes the basic steps you need to optimize an architecture using NAS framework.

### Main chapters of this notebook:
1. Setup the environment
1. Prepare dataset and create dataloaders
1. Create model and move it into search space
1. Pretrain constructed search space
1. Search the best architecture
1. Tune model with the best architecture
1. Make model portable

## Setup the environment
First, let's set up the environment and make some common imports.

In [None]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
# You may need to uncomment and change this variable to match free GPU index
# os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [None]:
import sys

sys.path.append('../')

from functools import partial
from pathlib import Path

import torch
import torch.nn as nn
import numpy as np
from torch_optimizer import RAdam
from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR
from torchvision.models.mobilenet import mobilenet_v2

from enot.autogeneration import TransformationParameters
from enot.autogeneration import generate_pruned_search_variants_model
from enot.latency import current_latency
from enot.latency import initialize_latency
from enot.latency import min_latency
from enot.latency import max_latency
from enot.latency import mean_latency
from enot.models import SearchSpaceModel
from enot.optimize import FixedLatencySearchOptimizer
from enot.optimize import PretrainOptimizer
from enot.optimize.search import EvolutionSearch
from enot.utils.benchmark import TorchBenchmarkRunner

from tutorial_utils.checkpoints import download_autogen_pretrain_checkpoint
from tutorial_utils.dataset import create_imagenette_dataloaders
from tutorial_utils.phases import tutorial_train_loop
from tutorial_utils.train import WarmupScheduler
from tutorial_utils.train import accuracy

### In the following cell we setup all necessary dirs

* `HOME_DIR` - experiments home directory
* `DATASETS_DIR` - root directory for datasets (imagenette2, ...)
* `PROJECT_DIR` - project directory to save training logs, checkpoints, ...

In [None]:
HOME_DIR = Path.home() / '.optimization_experiments'
DATASETS_DIR = HOME_DIR / 'datasets'
PROJECT_DIR = HOME_DIR / 'getting_started'

HOME_DIR.mkdir(exist_ok=True)
DATASETS_DIR.mkdir(exist_ok=True)
PROJECT_DIR.mkdir(exist_ok=True)

## Prepare dataset and create dataloaders

We will use dataset Imagenette2 in this example. <br>
`create_imagenette_dataloaders` function prepares datasets for you in this example; specifically, it:
1. downloads and unpacks dataset into `DATASETS_DIR`;
1. splits dataset into 4 parts, and saves annotations of every part in `PROJECT_DIR` (We need this to prevent the overfitting effect, but for big datasets you can split dataset as usual train/val/test);
1. creates dataloaders for every stage, and returns dataloaders as a dictionary.

The four parts of the dataset:
* train: for pretrain and tuning stages (`PROJECT_DIR`/train.csv)
* validation: to choose checkpoint for architecture optimization (`PROJECT_DIR`/validation.csv)
* search: for architecture search stage (`PROJECT_DIR`/search.csv)
* test: hold-out data for testing (`PROJECT_DIR`/test.csv)

Available dataloaders in the dictionary:
* pretrain_train_dataloader:
    training dataloader for "pretrain" stage (using data from `PROJECT_DIR`/train.csv)

* pretrain_validation_dataloader:
    validation dataloader for "pretrain" stage (using data from `PROJECT_DIR`/validation.csv)

* search_train_dataloader:
    training dataloader for "search" stage (using data from `PROJECT_DIR`/search.csv)

* search_validation_dataloader:
    validation dataloader for "search" stage (using data from `PROJECT_DIR`/search.csv)

* tune_train_dataloader:
    training dataloader for "finetune" stage (same as pretrain_train_dataloader)

* tune_validation_dataloader:
    validation dataloader for "finetune" stage (same as pretrain_validation_dataloader)

**NOTE:**<br>
CSV annotations use the following format:
```
filepath,label
<relative_path_1>,<int_label_1>
<relative_path_2>,<int_label_2>
...
```

In [None]:
dataloaders = create_imagenette_dataloaders(
    dataset_root_dir=DATASETS_DIR,
    project_dir=PROJECT_DIR,
    input_size=(224, 224),
    batch_size=8,
)

## Create model and move it into search space

Our architecture optimization procedure selects the best combination of operations from a user-defined search space. The easiest way to define one is to take a base architecture and add similar operations with different parameters: kernel size, expansion ratio, activation, etc. You can also add different kinds of operations or implement your own (see <span style="color:green;white-space:nowrap">***10. Tutorial - adding a custom operation for model builder***</span>).

In this example, we took MobileNet v2 as a base architecture and made a search space from MobileNet inverted bottleneck blocks.

First, let's define a model for an image classification task. MobileNet-like models can be built by `build_mobilenet` function. `build_mobilenet` function returns a model with the following structure: (inputs) -> stem -> blocks -> head -> (outputs). Stem and head have a fixed structure, while blocks consist of user-defined operations. The template class follows MobileNet v2 structure, which can be found in [torchvision library](https://github.com/pytorch/vision/blob/master/torchvision/models/mobilenetv2.py).

In [None]:
from enot.models import SearchSpaceModel
from enot.models.mobilenet import MobileNetBaseHead
from enot.models.mobilenet import MobileNetBaseStem
from enot.models.operations import SearchableMobileInvertedBottleneck
from enot.models.operations import SearchableFuseableSkipConv
from enot.models.operations import SearchVariantsContainer
from enot.models.operations import SearchableFuseableSkipConv
import torch.nn as nn

# defining your model class
class MyModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.stem = MobileNetBaseStem(in_channels=3)
        self.body = nn.ModuleList(
            [
                # 3 blocks with 3 search options in each block
                self.build_search_variants_1(16, 24),
                # 2 fixed blocks
                self.build_mib_k3_e6(24, 32, 2),
                self.build_mib_k3_e6(32, 32, 1),
                # 3 blocks with 3 search options in each block
                self.build_search_variants_1(32, 64),
                # 1 fixed block
                self.build_mib_k3_e6(64, 96, 2),
                # 2 blocks with 3 search options in each block
                self.build_search_variants_1(96, 160),
                # 1 block with 3 search options
                self.build_search_variants_1(160, 320),
            ]
        )
        self.dropout = self.build_dropout_variants()
        self.head = MobileNetBaseHead(
            bottleneck_channels=320,
            last_channels=1280,
            num_classes=10,
            dropout_rate=0.0,
        )
        head_list = list(self.head.head)
        head_list.insert(3, self.dropout)
        self.head.head = torch.nn.Sequential(*head_list)

    @staticmethod
    def build_dropout_variants():
        return SearchVariantsContainer(
            [
                torch.nn.Dropout(p=0.0),
                torch.nn.Dropout(p=0.25),
                torch.nn.Dropout(p=0.5),
                torch.nn.Dropout(p=0.75),
            ]
        )

    @staticmethod
    def build_search_variants_1(in_channels, out_channels):
        return SearchVariantsContainer(
            [
                SearchableMobileInvertedBottleneck(
                    in_channels=in_channels,
                    out_channels=out_channels,
                    kernel_size=3,
                    stride=2,
                    expand_ratio=6,
                    padding=4,
                    use_skip_connection=True,
                    activation='relu',
                ),
                SearchableMobileInvertedBottleneck(
                    in_channels=in_channels,
                    out_channels=out_channels,
                    kernel_size=5,
                    stride=2,
                    expand_ratio=6,
                    padding=10,
                    use_skip_connection=False,
                    activation='relu6',
                ),
                SearchableMobileInvertedBottleneck(
                    in_channels=in_channels,
                    out_channels=out_channels,
                    kernel_size=7,
                    stride=1,
                    expand_ratio=3,
                    padding=12,
                    use_skip_connection=True,
                    activation='silu',
                ),
                SearchableFuseableSkipConv(
                    in_channels=in_channels,
                    out_channels=out_channels,
                    stride=2,
                ),
            ]
        )

    @staticmethod
    def build_mib_k3_e6(in_channels, out_channels, stride):
        return SearchableMobileInvertedBottleneck(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=3,
            stride=stride,
            expand_ratio=6,
        )

    def forward(self, x):
        x = self.stem(x)

        for block in self.body:
            x = block(x)

        x = self.head(x)

        return x

In [None]:
model = MyModel()

# move model to search space
search_space = SearchSpaceModel(model).cuda()

## Pretrain constructed search space

"Pretrain" phase is the first phase of NAS procedure. In "pretrain" phase, we train user-defined search space with different architecture combinations to further compare their task performance in search phase. 

We offer `PretrainOptimizer` class (a wrapper for regular pytorch optimizer), which does all necessary "pretrain" magic. You should use `search_space` as the model and `PretrainOptimizer` as the optimizer in your train loop.

Follow these steps to turn your train loop into enot "pretrain" loop:
1. Pass `search_space.model_parameters()` to the optimizer instead of `search_space.parameters()`.
1. Use `PretrainOptimizer` as model optimizer.
1. Wrap model step with a closure, and send the closure into `pretrain_optimizer.step(...)` method as parameter. This is necessary because `PretrainOptimizer` does more than one step per batch. Alternatively, you can use `pretrain_optimizer.model_step(...)` in the combination with `pretrain_optimizer.step()` to make multiple gradient accumulations before the actual optimizer step.
1. Call `search_space.sample_random_arch()` method on each validation step. This method samples a single architecture in the search space. This is necessary if you want measure the expectation of the search space performance, not the score of an individual random model.
1. By default, `PretrainOptimizer` require one batch of train data to initialize optimizations, so you should run `search_space.initialize_output_distribution_optimization(...)` before first model step. You can disable optimization checking in `PretrainOptimizer` constructor (`check_recommended_optimizations` parameter), but it is not recomended.

**IMPORTANT:**
We set `N_EPOCHS`= 3 in this example to make tutorial execution faster. This is not enough for good pretrain quality, and you should set `N_EPOCHS`= 300 if you want to reproduce our results.

In [None]:
N_EPOCHS = 3
N_WARMUP_EPOCHS = 1

train_loader = dataloaders['pretrain_train_dataloader']
validation_loader = dataloaders['pretrain_validation_dataloader']

# using `search_space.model_parameters()` as optimizable variables
optimizer = SGD(params=search_space.model_parameters(), lr=0.06, momentum=0.9, weight_decay=1e-4)
# using `PretrainOptimizer` as a default optimizer
pretrain_optimizer = PretrainOptimizer(search_space=search_space, optimizer=optimizer)

len_train_loader = len(train_loader)
scheduler = CosineAnnealingLR(optimizer, T_max=len_train_loader * N_EPOCHS, eta_min=1e-8)
scheduler = WarmupScheduler(scheduler, warmup_steps=len_train_loader * N_WARMUP_EPOCHS)

metric_function = accuracy
loss_function = nn.CrossEntropyLoss().cuda()

for epoch in range(N_EPOCHS):

    print(f'EPOCH #{epoch}')

    search_space.train()
    train_metrics_acc = {
        'loss': 0.0,
        'accuracy': 0.0,
        'n': 0,
    }
    for inputs, labels in train_loader:
        # By default, `PretrainOptimizer` requires one batch of train data to initialize optimizations,
        # so you should run `search_space.initialize_output_distribution_optimization(...)` before the first
        # model step. You can disable optimization checking in `PretrainOptimizer` constructor
        # (`check_recommended_optimizations` parameter), but this is not recommended.
        if not search_space.output_distribution_optimization_enabled:
            search_space.initialize_output_distribution_optimization(inputs)

        pretrain_optimizer.zero_grad()
        # Wrapping model step and backward with closure.
        # Alternatively, here is `pretrain_optimizer.model_step(...)` example usage for gradient accumulation:
        #
        # pretrain_optimizer.zero_grad()
        # for inputs, labels in train_loader:
        #
        #     def closure():
        #         ...
        #
        #     pretrain_optimizer.model_step(closure)
        #     if (n + 1) % 10 == 0:
        #         pretrain_optimizer.step()
        #         pretrain_optimizer.zero_grad()
        def closure():
            pred_labels = search_space(inputs)
            batch_loss = loss_function(pred_labels, labels)
            batch_loss.backward()
            batch_metric = metric_function(pred_labels, labels)

            train_metrics_acc['loss'] += batch_loss.item()
            train_metrics_acc['accuracy'] += batch_metric.item()
            train_metrics_acc['n'] += 1

        pretrain_optimizer.step(closure)
        if scheduler is not None:
            scheduler.step()

    train_loss = train_metrics_acc['loss'] / train_metrics_acc['n']
    train_accuracy = train_metrics_acc['accuracy'] / train_metrics_acc['n']

    print('train metrics:')
    print('  loss:', train_loss)
    print('  accuracy:', train_accuracy)

    # We are validating on one specific fixed architecture to avoid the effect of fluctuations
    # in the values of the loss function and metrics.
    arch_to_test = [0] * len(search_space.search_variants_containers)
    test_model = search_space.get_network_by_indexes(arch_to_test)
    test_model.eval()

    validation_loss = 0
    validation_accuracy = 0
    with torch.no_grad():
        for inputs, labels in validation_loader:
            pred_labels = test_model(inputs)
            batch_loss = loss_function(pred_labels, labels)
            batch_metric = metric_function(pred_labels, labels)

            validation_loss += batch_loss.item()
            validation_accuracy += batch_metric.item()

    n = len(validation_loader)
    validation_loss /= n
    validation_accuracy /= n

    print('validation metrics:')
    print('  loss:', validation_loss)
    print('  accuracy:', validation_accuracy)

    print()

## Search the best architecture

In [None]:
# Make function to evaluate optimized metric it can be metrics or loss
def _evaluate(model, metric_function, dataloader):
    metric = 0
    for inputs, labels in dataloader:
        pred_labels = model(inputs)
        batch_metric = metric_function(pred_labels, labels)
        metric += batch_metric.item()

    return metric / len(validation_loader)


def evaluate_accuracy(model):
    return _evaluate(
        model=model,
        metric_function=accuracy,
        dataloader=dataloaders['search_train_dataloader'],
    )


def evaluate_loss(model):
    return -_evaluate(
        model=model,
        metric_function=loss_function,
        dataloader=dataloaders['search_train_dataloader'],
    )

In [None]:
from fvcore.nn.flop_count import FlopCountAnalysis

latency_benchmark = TorchBenchmarkRunner(
    benchmark_iterations=1000,
    num_threads_per_process=2,
    benchmark_max_time_s=3,
)

data_sample = next(iter(dataloaders['search_train_dataloader']))[0].cuda()
search_space = search_space.cuda()


def measure_model_latency(model):
    return latency_benchmark(model, data_sample)[0]


def measure_model_throughput(model):
    return -latency_benchmark(model, data_sample)[-1]


def measure_model_mmac(model):
    # or search.space.forward_latency
    fca = FlopCountAnalysis(model=model, inputs=data_sample)
    fca.uncalled_modules_warnings(False)
    fca.unsupported_ops_warnings(False)
    return fca.total() / 1e6

### Find architecture with best accuracy without any latency constrains

In [None]:
op_searcher = EvolutionSearch(
    search_space=search_space,
    evaluate_function=evaluate_accuracy,
    search_steps=20,
)

In [None]:
best_arch, best_acc = op_searcher.find_best_arch(return_metrics=True)

In [None]:
best_arch, best_acc

### Find architecture with maximal accuracy and defined mmac

In [None]:
search_space.sample(
    [
        [0],
    ]
    * len(search_space.search_variants_containers)
)
max_mmac = measure_model_mmac(search_space)

search_space.sample(
    [
        [3],
    ]
    * len(search_space.search_variants_containers)
)
min_mmac = measure_model_mmac(search_space)
print(max_mmac, min_mmac)

In [None]:
op_searcher = EvolutionSearch(
    search_space=search_space,
    evaluate_function=evaluate_accuracy,
    calculate_latency_function=measure_model_mmac,
    target_latency=1050,
    search_steps=20,
)

In [None]:
best_arch, best_acc, best_lat = op_searcher.find_best_arch(return_metrics=True)

In [None]:
best_arch, best_acc, best_lat

### Find architecture with maximal accuracy and defined latency

In [None]:
ss_n_cont = len(search_space.search_variants_containers)
search_space.eval()
with torch.no_grad():
    search_space.sample(
        [
            [0],
        ]
        * ss_n_cont
    )
    print(measure_model_latency(search_space))

    search_space.sample(
        [
            [3],
        ]
        * ss_n_cont
    )
    print(measure_model_latency(search_space))

In [None]:
op_searcher = EvolutionSearch(
    search_space=search_space,
    evaluate_function=evaluate_accuracy,
    target_latency=0.0023,
    search_steps=20,
    calculate_latency_function=measure_model_latency,
)

In [None]:
best_arch, best_acc, best_lat = op_searcher.find_best_arch(return_metrics=True)

In [None]:
best_arch, best_acc, best_lat

### Find architecture with minimal loss and defined throughput

In [None]:
ss_n_cont = len(search_space.search_variants_containers)
search_space.eval()
with torch.no_grad():
    search_space.sample(
        [
            [0],
        ]
        * ss_n_cont
    )
    print(measure_model_throughput(search_space))

    search_space.sample(
        [
            [3],
        ]
        * ss_n_cont
    )
    print(measure_model_throughput(search_space))

In [None]:
op_searcher = EvolutionSearch(
    search_space=search_space,
    evaluate_function=evaluate_loss,
    target_latency=-350,
    search_steps=20,
    calculate_latency_function=measure_model_throughput,
)

In [None]:
best_arch, best_acc, best_lat = op_searcher.find_best_arch(return_metrics=True)

In [None]:
best_arch, best_acc, best_lat

## Tune model with the best architecture
Now we take our best architecture from search space, and create a regular model using it. Then we run finetune procedure (usual training loop).

In [None]:
# Get regular model with the best architecture.
best_model = search_space.get_network_by_indexes(best_arch).cuda()
print(best_arch)

In [None]:
N_EPOCHS = 10

optimizer = SGD(best_model.parameters(), lr=2e-4, momentum=0.9, weight_decay=1e-4)
scheduler = None
metric_function = accuracy
loss_function = nn.CrossEntropyLoss().cuda()

train_loader = dataloaders['tune_train_dataloader']
validation_loader = dataloaders['tune_validation_dataloader']

for epoch in range(N_EPOCHS):

    print(f'EPOCH #{epoch}')

    best_model.train()
    train_loss = 0.0
    train_accuracy = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()

        pred_labels = best_model(inputs)
        batch_loss = loss_function(pred_labels, labels)
        batch_loss.backward()
        batch_metric = metric_function(pred_labels, labels)

        train_loss += batch_loss.item()
        train_accuracy += batch_metric.item()

        optimizer.step()
        if scheduler is not None:
            scheduler.step()

    n = len(train_loader)
    train_loss /= n
    train_accuracy /= n

    print('train metrics:')
    print('  loss:', train_loss)
    print('  accuracy:', train_accuracy)

    best_model.eval()
    validation_loss = 0
    validation_accuracy = 0
    with torch.no_grad():
        for inputs, labels in validation_loader:
            pred_labels = best_model(inputs)
            batch_loss = loss_function(pred_labels, labels)
            batch_metric = metric_function(pred_labels, labels)

            validation_loss += batch_loss.item()
            validation_accuracy += batch_metric.item()

    n = len(validation_loader)
    validation_loss /= n
    validation_accuracy /= n

    print('validation metrics:')
    print('  loss:', validation_loss)
    print('  accuracy:', validation_accuracy)

    print()

## Make model portable
Model will be free from our framework dependencies you can save it and load just with pytorch.

In [None]:
from enot.utils.common import make_portable

# Make sampled model free from our package
portable_model = make_portable(best_model).cuda()

# You can save it and load without our package just with pytorch
model_path = PROJECT_DIR / 'best_model.pth'
torch.save(portable_model, model_path)

In [None]:
# Reload kernel and run next cells only with torch dependent packages
import torch
import torch.nn as nn
from pathlib import Path

from tutorial_utils.dataset import create_imagenette_dataloaders
from tutorial_utils.train import accuracy


HOME_DIR = Path.home() / '.optimization_experiments'
DATASETS_DIR = HOME_DIR / 'datasets'
PROJECT_DIR = HOME_DIR / 'getting_started'
model_path = PROJECT_DIR / 'best_model.pth'

metric_function = accuracy
loss_function = nn.CrossEntropyLoss().cuda()

dataloaders = create_imagenette_dataloaders(
    dataset_root_dir=DATASETS_DIR,
    project_dir=PROJECT_DIR,
    input_size=(224, 224),
    batch_size=32,
)
validation_loader = dataloaders['tune_validation_dataloader']

# load saved portable model
best_model_enot_free = torch.load(model_path)

In [None]:
# Best model works without our package!
best_model_enot_free.eval()
validation_loss = 0
validation_accuracy = 0
with torch.no_grad():
    for inputs, labels in validation_loader:
        pred_labels = best_model_enot_free(inputs)
        batch_loss = loss_function(pred_labels, labels)
        batch_metric = metric_function(pred_labels, labels)

        validation_loss += batch_loss.item()
        validation_accuracy += batch_metric.item()

n = len(validation_loader)
validation_loss /= n
validation_accuracy /= n

print('validation metrics:')
print('  loss:', validation_loss)
print('  accuracy:', validation_accuracy)

print()