<hr style="border-width:2px;border-color:#84C7F7">
<center><h1> Cross-Domain Meta-learning Competition </h1></center>
<center><h2>  Any-way Any-shot learning </h2></center>
<hr style="border-width:2px;border-color:#84C7F7">

Make sure you have installed all the dependencies in your kernel environment. If you ran the <code>quick_start.sh</code> script, make sure you activated the **cdml** conda environment before launching the jupyter notebook. The link of the CodaLab competition where you can submit your code and check the leaderboard will be available here.


**Outline**: 
- [**I - Data exploration**](#0): We define the any-way any-shot learning setup and explore how the data is formatted.
- [**II - Submission details**](#1): We present how a submission should be organized.
- [**III - Test and submission**](#2): We present how to test a potential submission and also how to zip your scripts to submit your code on CodaLab. 

<a name='0'></a>
# I - Data exploration

The goal of this section is to familiarize participants with the data format used in the challenge.

Cross-Domain Meta-Learning procedures aim to produce a Learner that is able to quickly adapt to new tasks from unseen domains using only a few examples. In the standard Machine Learning setting, we usually split the data in train/test sets, these datasets then contain **examples** assumed to be generated from the same distribution. In few-shot learning, we have the same idea but with one additional level of abstraction: we have a meta-train and meta-test split (optionally a meta-validation split as well). In **single-domain few-shot learning**, meta-train and meta-test dataset are assumed to have classes generated from **the same task distribution**. However, in the **cross-domain scenario**, the meta-datasets are generated from **different task distributions**. To simulate the cross-domain scenario we will provide 10 public datasets from different domains that can be used to test your algorithms locally.

During the challenge we generate on the fly the meta-train, meta-validation and meta-test sets. 
* **Meta-training**: with data sampled from the meta-train pool, we could meta-train a MetaLearner, i.e. try to learn the best approach to tackle different tasks.
* **Meta-validation**: with data sampled from the meta-validation pool, we could adjust the meta-learner's hyper-parameters without worrying about any data leakage.
* **Meta-testing**: with data sampled from the meta-test pool, we evaluate the Learner produced by the meta-learning procedure to quickly adapt to new unseen tasks. In order to measure the performance of such behavior, we define what we call **episodes**. These are small tasks, i.e., have only a few examples of unseen classes.

Let's formalize some of the ideas exposed above.

## Definitions

In this challenge there are 2 different ways to generate data during meta-training. We can either generate data in the form of **episodes** or **batches**. Let's first describe these 2 methods: 

An **episode**, which represents a **task**, is described as follows: 
$$ \mathcal{T} = \{ \mathcal{D}_{train}, \mathcal{D}_{test}\}$$
where $\mathcal{D}_{train} = \{x_{i}, y_{i}\}_{i \in \mathcal{I}_{train}}$ is the training set of the task, often called **support set**. $\mathcal{D}_{test} = \{x_{i}, y_{i}\}_{i \in \mathcal{I}_{test}}$ is the test set of the task, often called **query set**. Note that $\mathcal{I}_{train}$ and $\mathcal{I}_{test}$ are indices of the train and test set examples respectively. Note that the data contained in one episode belongs extrictly to one of the public datasets while different episodes may come from different datasets. 

A **batch** is a collection of sampled examples from the meta-train pool **without enforcing a configuration**. We can specify the batch size which is the number of examples to be sampled from the pool. We would directly sample examples from the pool without sampling **classes** as it is the case for episodes. More importantly, there would be no aforementionned $\mathcal{D}_{test}$ unlike the episodic setting. Note that the data contained in one batch may be sampled from multiple of the public datasets.

The figure below illustrates the difference between the **episodic** setting and  the **batch** setting.

<img src="train_settings.png" alt="Train settings" width="700">



## The any-way any-shot learning problem

The few-shot learning problems are often referred as N-way K-shots problem. This name refers to episodes configuration at **meta-test time**. The number of **ways** N denotes the number of classes in an episode that represents an image classification problem. The number of **shots** K denotes the number of examples per class in the **support set**. In our case, we focus on the **any-way any-shot** setting. In other words, episodes at meta-test time represent image classification problems with a number of classes varying from 2 to 20, and the **support set** contains 1 to 20 labelled example per class. Thus, at meta-test time your algorithm may be testing in the following way:
- **Test episode 1:** 5-way 1-shot task.
- **Test episode 2:** 3-way 15-shots task.
- **Test episode 3:** 12-way 4-shots task.
- $\vdots$

Let's summarize the different parts of the meta-learning procedure.

* At **meta-train** time: This is the part you have control on. You can choose to generate data from the meta-train split in the form of **episodes** or **batches**. If you choose the episodic setting, the generated tasks at meta-train time are N-way any-shot taks where you can specify the number of classes and the boundaries for the number of examples per class. If you want to generate N-way K-shots tasks you must specify K as both the lower and upper bounds for the number of examples per class.
* At **meta-validation** and **meta-test** time: We always evaluate your learning algorithm using the same setting, we generate new unseen tasks from the corresponding pool in the form of episodes. Actually these episodes have a fixed configuration, the any-way any-shot setting. It essentially means that when you receive a new unseen task, the support set (i.e. train set) will be composed of 1 to 20 examples from 2 to 20 classes. The query set (i.e. test set) is composed of all the available examples of the corresponding classes that are not used in the support set. 

As we mentioned previously, in this challenge, the episodes are generated **on the fly** from our datasets. Also, it is worth mentioning that the episodes and batches are coming from **generators**, meaning that there are virtually infinite. 
 
**Note**: Make sure you have downloaded the public datasets under <code>public_data/</code> directory in the root directory of this project, i.e. **../../cd-metadl**. As soon as the competition starts, the public datasets will be directly downloaded using the `quick_start.sh` script. 

Let's see how it looks like in practice.

In [None]:
from cdmetadl.ingestion_program.data_generator import TrainGenerator

data_dir = "../../public_data" # Path to Public data

# To initialize the TrainGenerator you can define the following arguments:
# data_format: Format for the training data, it can be 'episode' or 'batch'.
# train_pool_size: Percentage of the available classes that should be used to 
#                  generate the training examples. The remaining percentage 
#                  will be kept for validation.
# num_ways: Number of classes for the generated tasks. Only used when 
#           data_format is 'episode'.
# min_s: Minimum number of shots for the generated tasks. 
# max_s: Maximum number of shots for the generated tasks.
# fixed_query_size: Flag to control the size of the query set. If true, query 
#                   size must be specified, else, all the available information
#                   not used for the support set will be used as query set.
# query_size: Number of images for the query set.
train_generator = TrainGenerator(data_dir, 
                                 data_format = "episode",
                                 train_pool_size = 0.75,
                                 num_ways = 5,
                                 min_s = 5,
                                 max_s = 10,
                                 fixed_query_size = True,
                                 query_size = 10)

# The initialized TrainGenerator creates 2 generators as attributes:
# Meta-train data generator: meta_train_generator
# Meta-valid data generator: meta_valid_generator
meta_train_generator = train_generator.meta_train_generator
meta_valid_generator = train_generator.meta_valid_generator

In the previous cell, we created a <code>TrainGenerator</code> object. You receive data during meta-training through this object. Notice that you can specify the configuration of meta-train and meta-valid episodes, but you could switch to <code>data_format="batch"</code> if you think it would improve your meta-algorithm performance. We are going to visualize data generated as **episodes** and **batches** in the next code cells. 

In [None]:
# Helpers to visualize the episodes and batches.

import os
from collections import Counter

import numpy as np
import matplotlib.pyplot as plt


def plot_episode(support_images: np.ndarray, 
                 support_labels: np.ndarray, 
                 query_images: np.ndarray,
                 query_labels: np.ndarray, 
                 size_multiplier: float = 2, 
                 max_imgs_per_col: int = 10,
                 max_imgs_per_row: int = 10) -> None:
    """ Plots the content of an episode. Episodes are composed of a support set 
    (training set) and a query set (test set). 
    
    Args:
        support_images (np.ndarray): Images in the support set, they have a 
            shape of (batch_size_support x height x width x channels).
        support_labels (np.ndarray): Labels in the support set, they have a 
            shape of (batch_size_support, ). 
        query_images (np.ndarray): Images in the query set, they have a 
            shape of (batch_size_query x height x width x channels).
        query_labels (np.ndarray): Labels in the query set, they have a 
            shape of (batch_size_query, ). 
        size_multiplier (float, optional): Dilate or shrink the size of 
            displayed images. Defaults to 2.
        max_imgs_per_col (int, optional): Number of images in a column. 
            Defaults to 10.
        max_imgs_per_row (int, optional): Number of images in a row. Defaults 
            to 10.
    """
    
    for name, images, class_ids in zip(("Support", "Query"),
                                     (support_images, query_images),
                                     (support_labels, query_labels)):
        n_samples_per_class = Counter(class_ids)
        n_samples_per_class = {k: min(v, max_imgs_per_col) 
            for k, v in n_samples_per_class.items()}
        id_plot_index_map = {k: i for i, k
            in enumerate(n_samples_per_class.keys())}
        num_classes = min(max_imgs_per_row, len(n_samples_per_class.keys()))
        max_n_sample = max(n_samples_per_class.values())
        figwidth = max_n_sample
        figheight = num_classes
        figsize = (figheight * size_multiplier, figwidth * size_multiplier)
        fig, axarr = plt.subplots(figwidth, figheight, figsize=figsize)
        fig.suptitle(f"{name} Set", size='15')
        fig.tight_layout(pad=3, w_pad=0.1, h_pad=0.1)
        reverse_id_map = {v: k for k, v in id_plot_index_map.items()}
        for i, ax in enumerate(axarr.flat):
            ax.patch.set_alpha(0)
            # Print the class ids, this is needed since, we want to set the x 
            # axis even there is no picture.
            ax.set(xlabel=reverse_id_map[i % figheight], xticks=[], yticks=[])
            ax.label_outer()
        for image, class_id in zip(images, class_ids):
            # First decrement by one to find last spot for the class id.
            n_samples_per_class[class_id] -= 1
            # If class column is filled or not represented: pass.
            if (n_samples_per_class[class_id] < 0 or
                id_plot_index_map[class_id] >= max_imgs_per_row):
                continue
            # If width or height is 1, then axarr is a vector.
            if axarr.ndim == 1:
                ax = axarr[n_samples_per_class[class_id] 
                    if figheight == 1 else id_plot_index_map[class_id]]
            else:
                ax = axarr[n_samples_per_class[class_id], 
                    id_plot_index_map[class_id]]
            ax.imshow(image)
        plt.show()

        
def plot_batch(images: np.ndarray, 
               labels:np.ndarray, 
               size_multiplier: int = 1) -> None:
    """ Plot the images in a batch.

    Args:
        images (np.ndarray): Images inside the batch, they have a shape of 
            (batch_size x height x width x channels).
        labels (np.ndarray): Labels inside the batch, they have a shape of
            (batch_size, )
        size_multiplier (int, optional): Dilate or shrink the size of 
            displayed images. Defaults to 1.
    """
    num_examples = len(labels)
    figwidth = np.ceil(np.sqrt(num_examples)).astype('int32')
    figheight = num_examples // figwidth
    figsize = (figwidth * size_multiplier, (figheight + 2.5) * size_multiplier)
    _, axarr = plt.subplots(figwidth, figheight, dpi=150, figsize=figsize)

    for i, ax in enumerate(axarr.transpose().ravel()):
        ax.imshow(images[i])
        ax.set(xlabel=str(labels[i]), xticks=[], yticks=[])
    
    plt.show()

In [None]:
N_EPISODES = 2

for i, episode in enumerate(meta_train_generator(N_EPISODES)):
    print(f"Episode id: {i+1} from source {data_dir}")
    print(f"# Ways: {episode.num_ways}")
    print(f"# Shots: {episode.num_shots}")
    print()
    plot_episode(support_images=episode.support_set[0], 
                 support_labels=episode.support_set[1],
                 query_images=episode.query_set[0], 
                 query_labels=episode.query_set[1])

In the figures above, you can observe the composition of an episode : A **support set** (train) and a **query set** (test). In the next cell, we present some useful caracteristics of an episode.

In [None]:
print("The episode object is organized the following way:\n" + 
      "Episode e:\n" +
      "   - e.num_ways: int\n" +
      "   - e.num_shots: int\n" +
      "   - e.support_set: Tuple(images, labels)\n" +
      "   - e.query_set: Tuple(images, labels)")
print(f"\n{'#'*70}\n")
print("The support set images are of the following shape: "
    + f"{episode.support_set[0].shape}")
print(f"The support set labels are: {np.unique(episode.support_set[1])} and "
    + f"their shape: {episode.support_set[1].shape}")
print(f"\n{'#'*70}\n")
print("The query set images are of the following shape: "
    + f"{episode.query_set[0].shape}")
print(f"The query set labels shape is: {episode.query_set[1].shape} \n")

Now let's take a look at the **batch** mode. 

Let's assume we'd like to receive data from the meta-train split in batches of 20 images. 

In [None]:
# The initialized TrainGenerator creates 2 generators as attributes:
# Meta-train data generator: meta_train_generator
# Meta-valid data generator: meta_valid_generator
batch_data_generator = TrainGenerator(data_dir, 
                                      data_format = "batch",
                                      train_pool_size = 0.75)

meta_train_generator = batch_data_generator.meta_train_generator
meta_valid_generator = batch_data_generator.meta_valid_generator

NUM_BATCHES = 1
BATCH_SIZE = 20
(images, labels) = next(meta_train_generator(NUM_BATCHES, BATCH_SIZE))
print(f"Batch images shape: {images.shape}")
print(f"Batch labels shape: {labels.shape}")

plot_batch(images, labels)

For the challenge, you don't need to create your generators, you will receive the meta-train and meta-valid generators (thus already initialized). The way you receive the data generators will be described in the next section. The default setting is the episodic setting with 5-way any-shot (between 1 and 20) tasks with query sets of 20 images and with a train pool of 75% of the available classes per dataset. However, if you think you could achieve better performance with your own meta-training setting, you can specify it. In order to specify your own setting, you need to write down your settings in a yaml file named **config.yaml** and put it in your submission folder before zipping it. We will go over the structure of submission folder in the next sections. Here is an example of a config file for the prototypical networks algorithms: 

**Content of a <code>config.yaml</code> file**:
```bash
data_format: episode 
num_ways: 60
min_s: 1
max_s: 20
fixed_query_size: False
```
Notice that the configurations that can be included in the file are exactly the arguments previously defined when we initialized the TrainGenerator. For clarity the available configurations are:

- `data_format`: Format for the training data, it can be 'episode' or 'batch'.
- `train_pool_size`: Percentage of the available classes that should be used to generate the training examples. The remaining percentage will be kept for validation.
- `num_ways`: Number of classes for the generated tasks. Only used when data_format is 'episode'.
- `min_s`: Minimum number of shots for the generated tasks. 
- `max_s`: Maximum number of shots for the generated tasks.
- `fixed_query_size`: Flag to control the size of the query set. If true, query size must be specified, else, all the available information not used for the support set will be used as query set.
- `query_size`: Number of images for the query set.

---

**Section summary** :

* You can choose to generate data from the meta-train split in the form of episodes or batches. Default configurations are episodic but you can change it via a **config.yaml** file that you put in your folder submission.
* You can choose to have access to episodes coming from the meta-validation split to match the evaluation at meta-test time. However, we do not allow you to generate data from the meta-validation split in batch mode.

<a name='1'></a>
# II - Submission details
In this section, we will review the structure of a valid submission. We will see that the data we receive for the learning algorithm follows the aforementioned structure.

The participants will have to submit a zip file containing one or several files. The crucial file to add is <code>model.py</code>. It contains the meta-learning algorithm logic. This file **has** to follow the specific API that we defined for the challenge described in the following figure: 

<img src="API.png" alt="Challenge API" width="500">

The 3 classes with their associated methods that need to be overwritten are the following:
* **MetaLearner**: The meta-learner contains the meta-algorithm logic. The <code>meta_fit(meta_train_generator, meta_valid_generator)</code> method has to be overwritten with your own meta-learning algorithm. It receives the data generators initialized with default setting or your **config.yaml** file.
* **Learner**: It encapsulates the logic to learn from a new unseen task. Several methods need to be overriden : 
 * <code>fit(dataset_train)</code>: Takes a support (train) set as an argument and fit the learner according to this dataset.
 * <code>save(path)</code>: You need to implement a way to save your model in the specified directory. 
 * <code>load(path)</code>: You need to implement a way to load your model from the file you created in <code>save(path)</code>.
* **Predictor**: The predictor contains the logic of your model to make predictions once the learner is fitted. The <code>predict(dataset_test)</code> encapsulates this step and takes a query (test) set as an argument, i.e. unlabelled examples.

## Walkthrough a submission example

In this sub-section, we present how your code submission folder should look like before zipping it.  

**Example of a submission directory**
```
proto
|   api.py      (Mandatory)
│   model.py    (Mandatory)
|   metadata    (Mandatory)
|   config.yaml (Optional but has to have this name)
│   helper.py   (Optional) 
│   utils.py    (Optional)
│   ...
```
<code>api.py</code>, <code>model.py</code> and <code>metadata</code> are the crucial files to be added. The former is the API that we provide and you have to overwrite in <code>model.py</code> with your learning algorithm. The <code>metadata</code> is just a file for the competition server to work properly, you simply add it to your folder without worrying about it (you can find this file in any given baseline's folder). Other files could be added and it us up to you to organize your code as you'd like.

## Defining the classes
We go through a dummy example to understand how to create a model. In the code cell below, you can find the **random** baseline. There are 2 important remarks:
- First, it is mandatory to **write a file** in the <code>path</code> given as an argument in the <code>save(path)</code> method. It could be a any file, some metadata that you gathered and/or your serialized neural network, but you need to include one.
- Then, one can notice that the shape of the array returned by the <code>predict</code> method depends on the query set of each task. In general, the shape must be (batch_size_query x num_classes). 

**Note**: You can always test your algorithm with <code>run.py</code> to verify everything is working properly. We explain how to run the script in the next section.

In [None]:
from cdmetadl.api.api import MetaLearner, Learner, Predictor

import os
import numpy as np
import pickle
from typing import Tuple


class MyMetaLearner(MetaLearner):

    def __init__(self, N_ways: int, total_train_classes: int) -> None:
        super().__init__(N_ways, total_train_classes)

    def meta_fit(self, meta_train_generator, meta_valid_generator) -> Learner:
        """ Uses the meta-dataset to fit the meta-learner's parameters. A 
        meta-dataset can be an epoch (list with batches of images) or a batch 
        of few-shot learning tasks.
        
        Args:
            meta_train_generator: Function that generates the training data.
                The generated can be an episode (N-ways any-shot learning task) 
                or a batch of images with labels.
            meta_valid_generator: Function that generates the validation data.
                The generated data always come in form of any-ways any-shot 
                learning tasks.
                
        Returns:
            Learner: Resulting learner ready to be trained and evaluated on 
                new unseen tasks.
        """
        return MyLearner()


class MyLearner(Learner):

    def __init__(self):
        super().__init__()

    def fit(self, 
            dataset_train: Tuple[np.ndarray,np.ndarray,int,int]) -> Predictor:
        """ Fit the Learner to the support set of a new unseen task. 
        
        Args:
            dataset_train: Support set of a task. The data arrive in the 
                following format (X_train, y_train, n_ways, k_shots). X_train 
                is the array of labeled imaged of shape 
                (n_ways*k_shots x 128 x 128 x 3), y_train are the encoded
                labels (int) for each image in X_train, n_ways (int) are the 
                number of classes and k_shots (int) the number of examples per 
                class.
                        
        Returns:
            Predictor: The resulting predictor ready to predict unlabelled 
                query image examples from the new unseen task.
        """
        _, y_train, _, _ = dataset_train
        return MyPredictor(y_train)

    def save(self, path_to_save: str) -> None:
        """ Saves the learning object associated to the Learner. It could be 
        a neural network for example. 
        
        Args:
            path_to_save (str): Path where the Learner will be saved
        """
        
        if not os.path.isdir(path_to_save):
            raise ValueError(("The model directory provided is invalid. Please"
                + " check that its path is valid."))
        
        pickle.dump(self, open(f"{path_to_save}/learner.pickle", "wb"))
 
    def load(self, path_to_model: str) -> Learner:
        """ Loads the learning object associated to the Learner. It should 
        match the way you saved this object in save().
        
        Args:
            path_to_model (str): Path where the Learner is saved
            
        Returns:
            Learner: Loaded learner
        """
        if not os.path.isdir(path_to_model):
            raise ValueError(("The model directory provided is invalid. Please"
                + " check that its path is valid."))
        
        model_file = f"{path_to_model}/learner.pickle"
        if os.path.isfile(model_file):
            with open(model_file, "rb") as f:
                saved_learner = pickle.load(f)
        return saved_learner
        
    
class MyPredictor(Predictor):

    def __init__(self, labels):
        super().__init__()
        self.labels = np.unique(labels)

    def predict(self, dataset_test) -> np.ndarray:
        """ Given a dataset_test, predicts the probabilities associated to the 
        provided images.
        
        Args:
            dataset_test: Array of unlabelled image examples of shape 
                (query_size x 128 x 128 x 3).
        
        Returns:
            np.ndarray: Predicted probs for all images. The array must be of 
                shape (query_size, N_ways).
        """
        random_pred = np.random.choice(self.labels, len(dataset_test))
        random_probs = np.zeros((random_pred.size, len(self.labels)))
        random_probs[np.arange(random_pred.size), random_pred] = 1
        return random_probs

You can refer to the <code>cd-metadl/baselines/</code> folder if you want to see submission examples. Here are the algorithms provided: 
- The **random** baseline.  
- The **naïve transfer learning** baseline.
- The **Prototypical Networks** based on  [J. Snell et al. - Prototypical Networks for Few-shot Learning (2017)](https://arxiv.org/pdf/1703.05175).
- The **MAML** algorithm based on [C. Finn et al. - Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (2017)](https://arxiv.org/pdf/1703.03400).
- The Prototypical Networks with the **Feature-wise transformation layers** proposed by  [H-Y. Tseng et al. - Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformation (2020)](https://arxiv.org/abs/2001.08735).

# III - Test and Submission

Here we present the <code>run.py</code> script. It is meant to mimick what is happenning on the CodaLab platform, i.e. the competition server. Let's say you worked on an algorithm and you are ready to test it before submitting it. More specifically, it will create your MetaLearner object, run the meta-fit method and evaluate your meta-algorithm on test episodes generated from the meta-test split. You can run the script command with the following arguments:
- <code>input_dir</code>: The path which contains the **public datasets**. 
- <code>submission_dir</code>: The path which contains your **algorithm's code** following the format we previously defined. 

In [None]:
!python -m cdmetadl.run --input_dir=../../public_data --submission_dir=../baselines/random

## Prepare a ZIP file ready for submission
Here we present how to zip your code to submit it on the CodaLab platform. As an example, we zip the folder <code>cd-metadl/baselines/random/</code> which corresponds to the random baseline which was introduced in the previous section.

In [None]:
from zip_utils import zipdir

model_dir = "../baselines/random/"
submission_filename = "mysubmission.zip"
zipdir(submission_filename, model_dir)
print(f"Submit this file: {submission_filename}")

## Summary 
For clarity, we summarize the steps that you should be aware of while making a submission : 
- Follow the **MetaLearner**/**Learner**/**Predictor** API to encapsulate your few-shot learning algorithm. Please make sure you name your subclasses as **MyMetaLearner**, **MyLearner** and **MyPredictor** respectively.
- Make sure you <u>save</u> at least a file in the given <code>path</code>. If this is a trained neural network, you need to serialize it in the <code>save()</code> method, and provide code to deserialize it in the <code>load()</code> method. Examples are provided in <code>cd-metadl/baselines/</code>.
- In your algorithm folder, make sure you have <code>api.py</code>, <code>model.py</code> and <code>metadata</code> with these **exact** names. If you want to use your custom configuration for the training generator make sure to include the **config.yaml** file.

--- 

## Next steps
Now you know all the steps required to create a valid code submission.

Good luck !