## Getting started with ENOT

This notebook describes the basic steps you need to optimize an architecture using ENOT framework.

### Main chapters of this notebook:
1. Setup the environment
1. Prepare dataset and create dataloaders
1. Create model and move it into search space
1. Pretrain constructed search space
1. Search best architecture
1. Tune model with the best architecture

## Setup the environment
First, let's set up the environment and make some common imports.

In [None]:
import os

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
# You may need to change this variable to match free GPU index
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [None]:
from pathlib import Path

import numpy as np
import torch
import torch.nn as nn

from torch.optim import SGD
from torch.optim.lr_scheduler import CosineAnnealingLR
from torch_optimizer import RAdam

from enot.models import SearchSpaceModel
from enot.models.mobilenet import build_mobilenet
from enot.optimize import EnotPretrainOptimizer
from enot.optimize import EnotSearchOptimizer
from enot.latency import initialize_latency
from enot.latency import min_latency
from enot.latency import max_latency
from enot.latency import mean_latency
from enot.latency import best_arch_latency

from tutorial_utils.train_utils import accuracy
from tutorial_utils.train_utils import WarmupScheduler

from tutorial_utils.checkpoints import download_getting_started_pretrain_checkpoint
from tutorial_utils.dataset import create_imagenette_dataloaders

### In the following cell we setup all necessary dirs

* `ENOT_HOME_DIR` - ENOT framework home directory
* `ENOT_DATASETS_DIR` - root directory for datasets (imagenette2, ...)
* `PROJECT_DIR` - project directory to save training logs, checkpoints, ...

In [None]:
ENOT_HOME_DIR = Path.home() / '.enot'
ENOT_DATASETS_DIR = ENOT_HOME_DIR / 'datasets'
PROJECT_DIR = ENOT_HOME_DIR / 'getting_started'

ENOT_HOME_DIR.mkdir(exist_ok=True)
ENOT_DATASETS_DIR.mkdir(exist_ok=True)
PROJECT_DIR.mkdir(exist_ok=True)

## Prepare dataset and create dataloaders

We will use dataset Imagenette2 in this example. <br>
`create_imagenette_dataloaders` function prepares datasets for you in this example; specifically, it:
1. downloads and unpacks dataset into `ENOT_DATASETS_DIR`;
1. splits dataset into 4 parts, and saves annotations of every part in `PROJECT_DIR`;
1. creates dataloaders for every stage, and returns dataloaders as a dictionary.

The four parts of the dataset:
* train: for pretrain and tuning stages (`PROJECT_DIR`/train.csv)
* validation: to choose checkpoint for architecture optimization (`PROJECT_DIR`/validation.csv)
* search: for architecture search stage (`PROJECT_DIR`/search.csv)
* test: hold-out data for testing (`PROJECT_DIR`/test.csv)

Available dataloaders in the dictionary:
* pretrain_train_dataloader:
    training dataloader for "pretrain" stage (using data from `PROJECT_DIR`/train.csv)

* pretrain_validation_dataloader:
    validation dataloader for "pretrain" stage (using data from `PROJECT_DIR`/validation.csv)

* search_train_dataloader:
    training dataloader for "search" stage (using data from `PROJECT_DIR`/search.csv)

* search_validation_dataloader:
    validation dataloader for "search" stage (using data from `PROJECT_DIR`/search.csv)

* tune_train_dataloader:
    training dataloader for "finetune" stage (same as pretrain_train_dataloader)

* tune_validation_dataloader:
    validation dataloader for "finetune" stage (same as pretrain_validation_dataloader)

**NOTE:**<br>
CSV annotations use the following format:
```
filepath,label
<relative_path_1>,<int_label_1>
<relative_path_2>,<int_label_2>
...
```

In [None]:
dataloaders = create_imagenette_dataloaders(
    dataset_root_dir=ENOT_DATASETS_DIR, 
    project_dir=PROJECT_DIR,
    input_size=(224, 224),
    batch_size=32,
)

## Create model and move it into search space

Our architecture optimization procedure selects the best combination of operations from a user-defined search space. The easiest way to define one is to take a base architecture and add similar operations with different parameters: kernel size, expansion ratio, activation, etc. You can also add different kinds of operations or implement your own (see "Tutorial - adding a custom operation").

In this example, we took MobileNet v2 as a base architecture and made a search space from MobileNet inverted bottleneck blocks.

First, let's define a model for an image classification task. MobileNet-like models can be built by `build_mobilenet` function. `build_mobilenet` function returns a model with the following structure: (inputs) -> stem -> blocks -> head -> (outputs). Stem and head have a fixed structure, while blocks consist of user-defined operations. The template class will follow MobileNet v2 structure, which can be found in [torchvision library](https://github.com/pytorch/vision/blob/master/torchvision/models/mobilenetv2.py).

In [None]:
# Search space will have these ops as choose options in each layer.
# Short format for operations is 'Name_param1=value1_param2=value2...'.
# MIB is a MNv2 inverted bottleneck, k is a kernel size for depthwise
# convolution, and t is an expansion ratio coefficient.
# See more in-depth info in "Tutorial - adding custom operations".
SEARCH_OPS = [
    'MIB_k=3_t=6',
    'MIB_k=5_t=6',
    'MIB_k=7_t=6',
]

# build model
model = build_mobilenet(
    search_ops=SEARCH_OPS,
    num_classes=10,
    blocks_out_channels=[24, 32, 64, 96, 160, 320],
    blocks_count=[2, 2, 2, 1, 2, 1],
    blocks_stride=[2, 2, 2, 1, 2, 1],
)

# move model to search space
search_space = SearchSpaceModel(model).cuda()

## Pretrain constructed search space

"Pretrain" phase is the first phase of NAS procedure. In "pretrain" phase, we train user-defined search space with different architecture combinations to further compare their task performance in search phase. 

We offer `EnotPretrainOptimizer` class (a wrapper for regular pytorch optimizer), which does all necessary "pretrain" magic. You should use `search_space` as the model and `EnotPretrainOptimizer` as the optimizer in your train loop.

Follow these steps to turn your train loop into enot "pretrain" loop:
1. Pass `search_space.model_parameters()` to the optimizer instead of `search_space.parameters()`.
1. Use `EnotPretrainOptimizer` as model optimizer.
1. Wrap model step with a closure, and send the closure into `enot_optimizer.step(...)` method as parameter. This is necessary because `EnotPretrainOptimizer` does more than one step per batch. Alternatively, you can use `enot_optimizer.model_step(...)` in the combination with `enot_optimizer.step()` to make multiple gradient accumulations before the actual optimizer step.
1. Call `search_space.sample_random_arch()` method on each validation step. This method samples a single architecture in the search space. This is necessary if you want measure the expectation of the search space performance, not the score of an individual random model.
1. By default, `EnotPretrainOptimizer` require one batch of train data to initialize optimizations, so you should run `search_space.initialize_output_distribution_optimization(...)` before first model step. You can disable optimization checking in `EnotPretrainOptimizer` constructor (`check_recommended_optimizations` parameter), but it is not recomended.

**IMPORTANT:**
We set `N_EPOCHS`= 3 in this example to make tutorial execution faster. This is not enough for good pretrain quality, and you should set `N_EPOCHS`= 300 if you want to reproduce our results.

In [None]:
N_EPOCHS = 3
N_WARMUP_EPOCHS = 1

train_loader = dataloaders['pretrain_train_dataloader']
validation_loader = dataloaders['pretrain_validation_dataloader']

# using `search_space.model_parameters()` as optimizable variables
optimizer = SGD(params=search_space.model_parameters(), lr=0.06, momentum=0.9, weight_decay=1e-4)
# using `EnotPretrainOptimizer` as a default optimizer
enot_optimizer = EnotPretrainOptimizer(search_space=search_space, optimizer=optimizer)

len_train_loader = len(train_loader)
scheduler = CosineAnnealingLR(optimizer, T_max=len_train_loader*N_EPOCHS, eta_min=1e-8)
scheduler = WarmupScheduler(scheduler, warmup_steps=len_train_loader*N_WARMUP_EPOCHS)

metric_function = accuracy
loss_function = nn.CrossEntropyLoss().cuda()

for epoch in range(N_EPOCHS):

    print(f'EPOCH #{epoch}')

    search_space.train()
    train_metrics_acc = {
        'loss': 0.0,
        'accuracy': 0.0,
        'n': 0,
    }
    for inputs, labels in train_loader:
        # By default, `EnotPretrainOptimizer` requires one batch of train data to initialize optimizations, 
        # so you should run `search_space.initialize_output_distribution_optimization(...)` before the first 
        # model step. You can disable optimization checking in `EnotPretrainOptimizer` constructor 
        # (`check_recommended_optimizations` parameter), but this is not recommended.
        if not search_space.output_distribution_optimization_enabled:
            search_space.initialize_output_distribution_optimization(inputs)

        enot_optimizer.zero_grad()
        # Wrapping model step and backward with closure.
        # Alternatively, here is `enot_optimizer.model_step(...)` example usage for gradient accumulation:
        #
        # enot_optimizer.zero_grad()
        # for inputs, labels in train_loader:
        #
        #     def closure():
        #         ...
        #
        #     enot_optimizer.model_step(closure)
        #     if (n + 1) % 10 == 0:
        #         enot_optimizer.step()
        #         enot_optimizer.zero_grad()
        def closure():
            pred_labels = search_space(inputs)
            batch_loss = loss_function(pred_labels, labels)
            batch_loss.backward()
            batch_metric = metric_function(pred_labels, labels)

            train_metrics_acc['loss'] += batch_loss.item()
            train_metrics_acc['accuracy'] += batch_metric.item()
            train_metrics_acc['n'] += 1

        enot_optimizer.step(closure)
        if scheduler is not None:
            scheduler.step()

    train_loss = train_metrics_acc['loss'] / train_metrics_acc['n']
    train_accuracy = train_metrics_acc['accuracy'] / train_metrics_acc['n']

    print('train metrics:')
    print('  loss:', train_loss)
    print('  accuracy:', train_accuracy)

    search_space.eval()
    validation_loss = 0
    validation_accuracy = 0
    for inputs, labels in validation_loader:

        # Sample random architecture from the search space to estimate
        # search space expected metrics.
        search_space.sample_random_arch()

        pred_labels = search_space(inputs)
        batch_loss = loss_function(pred_labels, labels)
        batch_metric = metric_function(pred_labels, labels)

        validation_loss += batch_loss.item()
        validation_accuracy += batch_metric.item()

    n = len(validation_loader)
    validation_loss /= n
    validation_accuracy /= n

    print('validation metrics:')
    print('  loss:', validation_loss)
    print('  accuracy:', validation_accuracy)

    print()

In [None]:
# We pretrained search space for 3 epochs in this example. In this cell, we are downloading
# search space checkpoint, pretrained for 300 epochs (for demonstration purposes).
checkpoint_path = PROJECT_DIR / 'getting_started_pretrain_checkpoint.pth'
download_getting_started_pretrain_checkpoint(checkpoint_path)

search_space.load_state_dict(
    torch.load(checkpoint_path)['model'],
)

## Search best architecture
Now, when you have a trained search space, you can run the "search" phase. The setup is similar to pretrain.

Follow these steps to turn your train loop into enot "search" loop:
Pass `search_space.architecture_parameters()` to the optimizer instead of `search_space.parameters()`.
1. Use `EnotSearchOptimizer` as model optimizer.
1. To initialize latency of all operations, your should use `initialize_latency` function from `enot.latency`. Currently, we support `mmac` latency type and two third-party calculators for this latency type: `mmac.thop`, `mmac.pthflops`.
1. Wrap model step with a closure, and send the closure into `enot_optimizer.step(...)` method as parameter. This is necessary because `EnotSearchOptimizer` does more than one step per batch. Alternatively, you can use `enot_optimizer.model_step(...)` in the combination with `enot_optimizer.step()` to make multiple gradient accumulations before the actual optimizer step.
1. To take into consideration latency during search, you should sum latency loss with your main loss.
1. You should run validation on the best architecture. The best architecture is an architecture constructed from operations with highest probabilities.

In [None]:
N_EPOCHS = 10

latency_loss_weight = 2e-3

# using `search_space.architecture_parameters()` as optimizable variables
optimizer = RAdam(search_space.architecture_parameters(), lr=0.01)
# using `EnotPretrainOptimizer` as a default optimizer

enot_optimizer = EnotSearchOptimizer(search_space, optimizer)

metric_function = accuracy
loss_function = nn.CrossEntropyLoss().cuda()

train_loader = dataloaders['search_train_dataloader']
validation_loader = dataloaders['search_validation_dataloader']

latency_type = 'mmac' # or mmac.thop, mmac.pthflops

for epoch in range(N_EPOCHS):

    print(f'EPOCH #{epoch}')

    search_space.train()
    train_metrics_acc = {
        'loss': 0.0,
        'accuracy': 0.0,
        'n': 0,
    }
    for inputs, labels in train_loader:
        
        # Initialize latency.
        if latency_type and search_space.latency_type is None:
            latency_container = initialize_latency(latency_type, search_space, (inputs, ))
            print(f'Constant latency = {latency_container.constant_latency}')
            print(f'Min, mean and max latencies of search space: '
                  f'{min_latency(latency_container)}, '
                  f'{mean_latency(latency_container)}, '
                  f'{max_latency(latency_container)}')

        enot_optimizer.zero_grad()

        # Wrapping model step and backward with closure.
        def closure():
            pred_labels = search_space(inputs)
            batch_loss = loss_function(pred_labels, labels)

            # adding latency loss to main loss
            if latency_loss_weight is not None and latency_loss_weight != 0:
                batch_loss += search_space.loss_latency_expectation * latency_loss_weight

            batch_loss.backward()
            batch_metric = metric_function(pred_labels, labels)

            train_metrics_acc['loss'] += batch_loss.item()
            train_metrics_acc['accuracy'] += batch_metric.item()
            train_metrics_acc['n'] += 1

        enot_optimizer.step(closure)

    train_loss = train_metrics_acc['loss'] / train_metrics_acc['n']
    train_accuracy = train_metrics_acc['accuracy'] / train_metrics_acc['n']
    arch_probabilities = np.array(search_space.architecture_probabilities)

    print('train metrics:')
    print('  loss:', train_loss)
    print('  accuracy:', train_accuracy)
    print('  arch_probabilities:')
    print(arch_probabilities)

    search_space.eval()

    # selecting best architecture for validation
    search_space.sample_best_arch()

    validation_loss = 0
    validation_accuracy = 0
    for inputs, labels in validation_loader:
        pred_labels = search_space(inputs)
        batch_loss = loss_function(pred_labels, labels)
        batch_metric = metric_function(pred_labels, labels)

        validation_loss += batch_loss.item()
        validation_accuracy += batch_metric.item()

    n = len(validation_loader)
    validation_loss /= n
    validation_accuracy /= n

    print('validation metrics:')
    print('  loss:', validation_loss)
    print('  accuracy:', validation_accuracy)
    if search_space.latency_type is not None:
        # getting latency of the best architecture
        latency = best_arch_latency(search_space)
        print('  latency:', latency)

    print()

## Tune model with the best architecture
Now we take our best architecture from search space, and create a regular model using it. Then we run finetune procedure (usual training loop).

In [None]:
# get regular model with the best architecture
best_model = search_space.get_network_with_best_arch().cuda()

In [None]:
N_EPOCHS = 10

optimizer = SGD(best_model.parameters(), lr=2e-4)
scheduler = None
metric_function = accuracy
loss_function = nn.CrossEntropyLoss().cuda()

train_loader = dataloaders['tune_train_dataloader']
validation_loader = dataloaders['tune_validation_dataloader']

for epoch in range(N_EPOCHS):

    print(f'EPOCH #{epoch}')

    best_model.train()
    train_loss = 0.0
    train_accuracy = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
            
        pred_labels = best_model(inputs)
        batch_loss = loss_function(pred_labels, labels)
        batch_loss.backward()
        batch_metric = metric_function(pred_labels, labels)
            
        train_loss += batch_loss.item()
        train_accuracy += batch_metric.item()

        optimizer.step()
        if scheduler is not None:
            scheduler.step()

    n = len(train_loader)
    train_loss /= n
    train_accuracy /= n

    print('train metrics:')
    print('  loss:', train_loss)
    print('  accuracy:', train_accuracy)
    
    best_model.eval()    
    validation_loss = 0
    validation_accuracy = 0
    for inputs, labels in validation_loader:
        pred_labels = best_model(inputs)
        batch_loss = loss_function(pred_labels, labels)
        batch_metric = metric_function(pred_labels, labels)

        validation_loss += batch_loss.item()
        validation_accuracy += batch_metric.item()
    
    n = len(validation_loader)
    validation_loss /= n
    validation_accuracy /= n
    
    print('validation metrics:')
    print('  loss:', validation_loss)
    print('  accuracy:', validation_accuracy)
    
    print()