# Introduction

Thanks for taking part in our competition and downloading the starting kit! We've compiled this readme to show
you how to run all the scripts that we'll use to evaluate your submissions, as well as give you a quick tour of what
we've included. Please feel free to reach out to use for help or clarifications, using the contact details on the
competition page. Additionally, check out the CodaLab wiki here: https://github.com/codalab/codalab-competitions/wiki

# About this Notebook
This notebook walks through writing a sample submission, and then takes that submission through our evaluation pipeline. Feel free to replace this sample submission with your own, so that you can check to make sure your submission works.

# Define our NAS class
Every submission must contain a `NAS` class. This class must contain a method `search` that receives input data and returns a valid PyTorch model for that data. We've written a very simple sample, one that modifies a ResNet18 to compatible with the dataset and task. This class is all you need
to provide for your submission, so don't worry about writing things data loading functions, all of the will be automatically handled once you submit. This sample submission is the "benchmark" submission; every submission will scored based on how it compares to this one. 

In [1]:
import torch.nn as nn
import torchvision

class NAS:
    def __init__(self):
        pass

    # given some input data, return the "best possible" architecture
    def search(self, train_x, train_y, valid_x, valid_y, metadata):
        n_classes = metadata['n_classes']

        # load resnet18 model (definitely not the best possible architecture, but it'll work as an example!)
        model = torchvision.models.resnet50()

        # reshape it to this dataset
        model.conv1 = nn.Conv2d(train_x.shape[1], 64, kernel_size=(7, 7), stride=1, padding=3)
        model.fc = nn.Linear(model.fc.in_features, n_classes, bias=True)
        return model

To submit, all you need to do is zip together a `nas.py` file that contains your NAS class, any helper scripts you might need, and the `metadata` file. Make sure to set the `full_training` flag accordingly in the metadata file, depending on whether you want to run the full training pipeline on our servers or just a quick debug pipeline.

# Load Data
When you submit, data loading will be handled for you, all you need to supply is the `NAS` class. However, for debugging it can be helpful to load the data manually, so here's how to do it.

## Data Format
When we load a dataset, we'll load five numpy arrays and the dataset metadata file.
    
The five arrays are as follows:
    * train_x: numpy array of shape (n_datapoints, channels, weight, height). This is our input training data to the model. Each dataset will comprise of images. All images within a dataset are of identical channel size and spatial size, however these will vary *between* datasets. These datapoints
    are pre-normalized and shuffled; there should be no need to perform data augmentation over them. These datapoints are exactly identical to the ones that will used to train the models found by your algorithm.
	* train_y: numpy array of shape (n_datapoints). These are our training labels. It's an array of integers, such that train_y[i] corresponds to the label of train_x[i]. 
    * valid_x: numpy array of shape (n_datapoints, channels, weight, height). Our input validation data. It'll be of the exact same shape of the training input data.
    * valid_y: numpy array of shape (n_datapoints). Our validation labels, again an array of integers.
    * test_x: numpy array of shape (n_datapoints, channels, weight, height). Our input test data. It'll be of the exact same shape of the training input data.
    
    
The metadata is a dictionary that contains the following keys:
    * batch_size: the batch size that will be used to train this da==taset
    * n_classes: the total number of classes in the classification task
    * lr: the learning rate that will be used to train this dataset
    * benchmark: the threshold used to determine zero point of scoring; your score on the dataset will equal '10 * (test_acc - benchmark) / (100-test_acc)'. 
        - This means you can score a maximum of 10 points on each dataset: a full 10 points will be awarded for 100% test accuracy, while 0 points will be awarded for a test accuracy equal to the benchmark. 
    * name: a unique name for this dataset

In [9]:
# load the exact data loaders that we'll use to load the data
from ingestion_program.nascomp.helpers import *

# load the exact retraining script we'll use to evaluate the found models
from ingestion_program.nascomp.torch_evaluator import *

# if you want to use the real development data, download the public data and set data_dir appropriately
data_dir = 'sample_data'


# find all the datasets in the given directory:
dataset_paths = get_dataset_paths(data_dir)
dataset_predictions = []
for path in dataset_paths:
    (train_x, train_y), (valid_x, valid_y), (test_x), metadata = load_datasets(path)
    print("=== {} {}".format(metadata['name'],"="*50))
    print("Train X shape:",train_x.shape)
    print("Train Y shape:",train_y.shape)
    print("Valid X shape:",valid_x.shape)
    print("Valid Y shape:",valid_y.shape)
    print("Test X shape:", test_x.shape)
    print("Metadata:", metadata)
    

    # initialize our NAS class
    nas = NAS()
    
    # search for a model
    model = nas.search(train_x, train_y, valid_x, valid_y, metadata)
    
    # package data for the evaluator
    data = (train_x, train_y), (valid_x, valid_y), test_x
    
    # retrain the model from scratch
    results = torch_evaluator(model, data, metadata, n_epochs=7, full_train=True)
    
    # clean up the NAS class
    del nas
    
    # save our predictions
    dataset_predictions.append(results['test_predictions'])
    print()

Train X shape: (100, 3, 28, 28)
Train Y shape: (100,)
Valid X shape: (100, 3, 28, 28)
Valid Y shape: (100,)
Test X shape: (100, 3, 28, 28)
Metadata: {'batch_size': 64, 'n_classes': 20, 'lr': 0.01, 'benchmark': 92.08, 'name': 'sample_dataset_0'}
===== EVALUATING sample_dataset_0 =====
Cuda available? True
=== EPOCH 0 ===
  Train Acc:     6.000%, Val Acc:    7.000%, Mem Alloc:  722.00MiB, T Remaining Est: 1.37s
  Train Loss:    3.260 , Val Loss:    3.001
  Current best score:    Val Acc:     7.000% @ epoch 0
=== EPOCH 1 ===
  Train Acc:     4.000%, Val Acc:    4.000%, Mem Alloc:  722.00MiB, T Remaining Est: 1.13s
  Train Loss:    3.231 , Val Loss:    3.009
  Current best score:    Val Acc:     7.000% @ epoch 0
=== EPOCH 2 ===
  Train Acc:     9.000%, Val Acc:    3.000%, Mem Alloc:  722.00MiB, T Remaining Est: 0.95s
  Train Loss:    3.629 , Val Loss:    3.446
  Current best score:    Val Acc:     7.000% @ epoch 0
=== EPOCH 3 ===
  Train Acc:    17.000%, Val Acc:    5.000%, Mem Alloc:  722

# Score the Predictions
Again this will be all handled for you upon submission, but here's a copy of the scoring script so you can test things locally. We first load the labels for each test dataset, and then compare the accuracy of your model's predictions against these test labels. This score is then
adjusted according to the score benchmark. The scores will be pretty terrible over the sample data; download the public data and use that to get an accurate picture of performance over the Development Phase data.

In [10]:
overall_score = 0
out = []
for i, path in enumerate(dataset_paths):

    # load the reference values
    ref_y = np.load(os.path.join(path, 'test_y.npy'))

    # load the dataset_metadata for this dataset
    metadata =  load_dataset_metadata(path)
    
    print("=== Scoring {} ===".format(metadata['name']))
    index = metadata['name'][-1]

    # load the model predictions
    pred_y = dataset_predictions[i]

    # compute accuracy
    score = sum(ref_y == pred_y)/float(len(ref_y)) * 100
    print("  Raw score:", score)
    print("  Benchmark:", metadata['benchmark'])

    # adjust score according to benchmark
    point_weighting = 10/(100 - metadata['benchmark'])
    score -= metadata['benchmark']
    score *= point_weighting
    print("  Adjusted:  ", score)

    # add per-dataset score to overall
    overall_score += score

    # add to scoring stringg
    out.append("Dataset_{}_Score: {:.3f}".format(index, score))
out.append("Overall_Score: {:.3f}".format(overall_score))

# print score
print(out)

=== Scoring sample_dataset_0 ===
  Raw score: 4.0
  Benchmark: 92.08
  Adjusted:   -111.21212121212119
=== Scoring sample_dataset_1 ===
  Raw score: 7.000000000000001
  Benchmark: 92.87
  Adjusted:   -120.43478260869573
=== Scoring sample_dataset_2 ===
  Raw score: 9.0
  Benchmark: 87.0
  Adjusted:   -60.0
['Dataset_0_Score: -111.212', 'Dataset_1_Score: -120.435', 'Dataset_2_Score: -60.000', 'Overall_Score: -291.647']
