# Models Recipes

In this page, we will show you how to customize your own models. In `carefree-learn`, it is fairly easy to define various kinds of models with three APIs: `register_ml_module` (for [ML models](#ML-Models)), `register_module` (for [Other Models](#Other-Models)) and `register_custom_module` (for [Complex Models](#Complex-Models)).

> You might notice that if you run the blocks with `register_*` calls for more than once, `carefree-learn` will throw a warning which says " '...' has already been registered ", and your changes will have no effect. This is intentional because normally we **DO NOT** want to register anything for more than once.
> 
> However, if you are using some interactive developing tools (e.g. Jupyter Notebook), it is very common to modify the implementations for more than once. In this case, we can set `allow_duplicate=True` in the `register_*` functions to bypass this check. And of course, this should **NEVER** happen in production for safety!

# Table of Content

- [One-Stage Models](#One-Stage-Models)
  - [ML Models](#ML-Models)
    - [Configurations](#Configurations)
  - [Other Models](#Other-Models)
- [Complex Models](#Complex-Models)
  - [Simple Complex Models](#Simple-Complex-Models)
    - [`__init__` / `forward`](#__init__-/-forward)
    - [`g_parameters` / `d_parameters`](#g_parameters-/-d_parameters)
    - [`train_step`](#train_step)
    - [`evaluate_step`](#evaluate_step)
    - [Summary](#Summary)
    - [Run it!](#Run-it!)
  - [Full set of Functionalities](#Full-set-of-Functionalities)
- [Appendix](#Appendix)
  - [ML Encodings](#ML-Encodings)
    - [Optimizations](#Optimizations)
  -[`ImageCallback`](#ImageCallback)

> You might also notice that:
> - The class name defined below somehow matches the registered name. This is also not required, since `carefree-learn` only cares about the name that you pass to the `register_*` function, and will not check the actual class name.

# Preparations

In [1]:
import torch
import cflearn

import numpy as np
import torch.nn as nn

from torch import Tensor
from typing import Dict

np.random.seed(142857)
torch.manual_seed(142857)

<torch._C.Generator at 0x2023902e330>

# One-Stage Models

We will first jump into the typical situation where we need to define one-stage models. The 'one-stage' here means that the training step only contains one optimizer step, so we can focus on how to define the forward pass of our models, and leave `carefree-learn` to handle other stuffs.

> The contrary of 'one-stage' models will be the [Complex Models](#Complex-Models). A typical 'complex' model is the `GAN` models, which in general should perform a generator optimizing step **AND** a discriminator optimizing step in one **SINGLE** training step.

## ML Models

In `carefree-learn`, Machine Learning Models will be slightly different to other models because:
- We have integrated some common data-preprocessing methods into the ML pipeline (e.g. `one_hot` encoding, `embedding`).
- There are some shared arguments that should be used by all ML models: input dimension, output dimension and number of history steps (this is used in timeseries tasks).

Therefore, `carefree-learn` has:
- Wrapped the registered `nn.Module` internally to make it suitable for ML pipeline.
- Introduced three (optional) pre-defined arguments for all ML Models: `input_dim`, `output_dim` and `num_history`.
  - We call it the 'dimension system' of `carefree-learn`.

We will dive into these details in the following sections step by step.

In [2]:
@cflearn.register_ml_module("my_linear0", allow_duplicate=False)
class MyLinear0(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.net = nn.Linear(input_dim, output_dim)
    
    def forward(self, net: Tensor) -> Tensor:
        return self.net(net)

And that's it! We can now integrate them into our ML pipeline with the `fit_ml` API:

In [3]:
n       = 100
in_dim  = 5
out_dim = 2

x = np.random.random([n, in_dim])
y = np.random.randint(0, 2, [n, 1])

m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_linear0",
    output_dim=out_dim,
    is_classification=True,
    # debug setting, indicating that we only train for one step
    fixed_steps=1,
)

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
MLModel                                                                                                           
  _                                                                                                               
    MyLinear0                                [-1, 5]                                  [-1, 2]                   12
      Linear                                 [-1, 5]                                  [-1, 2]                   12
Total params: 12
Trainable params: 12
Non-trainable params: 0
------------------------------------------------------------------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
------

- The `output_dim` passed to `fit_ml` will be passed into your model as well.
- The `input_dim` is not provided, and `carefree-learn` will use `x.shape[1]` as `input_dim`.

If the `input_dim` is specified, we will use it regardless of `x.shape[1]`:

In [4]:
@cflearn.register_ml_module("my_linear1", allow_duplicate=False)
class MyLinear1(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        self.net = nn.Linear(input_dim, output_dim)
    
    def forward(self, net: Tensor) -> Tensor:
        # duplicate the input
        return self.net(torch.cat([net, net], dim=1))

m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_linear1",
    # the input is duplicated, so we need to specify the `input_dim`
    input_dim=5 * 2,
    output_dim=out_dim,
    is_classification=True,
    # debug setting, indicating that we only train for one step
    fixed_steps=1,
)

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
MLModel                                                                                                           
  _                                                                                                               
    MyLinear1                                [-1, 5]                                  [-1, 2]                   22
      Linear                                [-1, 10]                                  [-1, 2]                   22
Total params: 22
Trainable params: 22
Non-trainable params: 0
------------------------------------------------------------------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
------

You might notice that in the 'summary' panel shown above, the `MyLinear1` module is 'wrapped' by `MLModel` and `_`. This is what `carefree-learn` does internally to make your model compatible for the ML pipeline.

Here's an example, with `one_hot` encoding and `embedding` considered, to show you why this kind of 'wrapping' is useful and powerful:

In [5]:
from cflearn import MERGED_KEY
from cflearn import ONE_HOT_KEY
from cflearn import EMBEDDING_KEY
from cflearn import NUMERICAL_KEY

@cflearn.register_ml_module("my_linear2", allow_duplicate=False)
class MyLinear2(nn.Module):
    def __init__(self, input_dim, output_dim):
        super().__init__()
        print("> input_dim", input_dim)
        self.net = nn.Linear(input_dim, output_dim)
    
    # notice that we use `batch` here, and the naming is important!
    def forward(self, batch: Dict[str, Tensor]) -> Tensor:
        merged = batch[MERGED_KEY]
        one_hot = batch[ONE_HOT_KEY]
        embedding = batch[EMBEDDING_KEY]
        numerical = batch[NUMERICAL_KEY]
        print()
        print(">>> merged", merged.shape)
        if one_hot is not None:
            print(">>> one_hot", one_hot.shape)
        if embedding is not None:
            print(">>> embedding", embedding.shape)
        if numerical is not None:
            print(">>> numerical", numerical.shape)
        print()
        return self.net(merged)

> As the comment says, the naming of the `forward` argument, `batch`, is important! Because `carefree-learn` will then know that you require the full batch, instead of a single `Tensor`.

The newly defined `my_linear1` model can be used as usual:

In [6]:
m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_linear2",
    output_dim=out_dim,
    is_classification=True,
    # debug setting, indicating that we only train for one step
    fixed_steps=1,
)

> input_dim 5

>>> merged torch.Size([1, 5])
>>> numerical torch.Size([1, 5])

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
MLModel                                                                                                           
  _                                                                                                               
    MyLinear2                                                                                                     
      Linear                                 [-1, 5]                                  [-1, 2]                   12
Total params: 12
Trainable params: 12
Non-trainable params: 0
------------------------------------------------------------------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pas

But the powerful part is that, it can now utilize the encoding methods (`one_hot` / `embedding`) provided by `carefree-learn`:

> We will use some encoding settings in a few following blocks. Please refer to the [Appendix](#ML-Encodings) section for more details.

In [7]:
n                 = 100
in_dim            = 5
one_hot_dim       = 13
embedding_dim     = 7
out_dim           = 2
# some encoding settings. Please refer to the `ML Encodings` section in the `Appendix` section for more details.
one_hot_setting   = dict(dim=one_hot_dim, methods="one_hot")
embedding_setting = dict(dim=embedding_dim, methods="embedding")
encoding_settings = {
    # one hot columns   : [6]
    6: one_hot_setting,
    # embedding columns : [5, 7] 
    5: embedding_setting,
    7: embedding_setting,
}

x = np.hstack([
    np.random.random([n, in_dim]),
    np.random.randint(0, embedding_dim, [n, 1]),
    np.random.randint(0, one_hot_dim, [n, 1]),
    np.random.randint(0, embedding_dim, [n, 1]),
])
y = np.random.randint(0, 2, [n, 1])

m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_linear2",
    output_dim=out_dim,
    is_classification=True,
    # encoding settings
    encoding_settings=encoding_settings,
    # debug setting, indicating that we only train for one step
    fixed_steps=1,
)

> input_dim 26

>>> merged torch.Size([1, 26])
>>> one_hot torch.Size([1, 13])
>>> embedding torch.Size([1, 8])
>>> numerical torch.Size([1, 5])

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
MLModel                                                                                                           
  Encoder                                    [-1, 8]                      [[-1, 13], [-1, 8]]                   56
    ModuleDict-1                                                                                                  
      OneHot                                    [-1]                                 [-1, 13]                    0
        Lambda                                  [-1]                                 [-1, 13]                    0
    ModuleDict-0                           

- We don't need to specify the `input_dim`. In this case, `carefree-learn` will use `merged_dim` as `input_dim`.
- `merged_dim` = `one_hot_dim` + `embedding_dim` + `numerical_dim`, because `carefree-learn` will simply concat every kind of inputs together to create the `merged` input.
- In the 'summary' panel, we can find that the `Embedding` module output `4` dimension `Tensor` with `56` trainable params. That's because we have `2` columns for embedding, each has `7` different values, so `56 = 2 * 7 * 4`.

Although it is already very powerful to have access to every part of the inputs, it is still pretty hard to utilize them, because currently we can only get the `merged_dim` in our `__init__` method. `carefree-learn` therefore provides a `dimensions` argument that gives you all you want.

For example, let's implement the famous [Wide & Deep](https://arxiv.org/pdf/1606.07792.pdf)-like model, which feeds the `one_hot` part to a `Linear` model, and feeds the `numerical` and `embedding` part to a `MLP` model:

In [8]:
@cflearn.register_ml_module("my_wide_and_deep", allow_duplicate=True)
class MyWideAndDeep(nn.Module):
    # notice that we use `dimensions` here, and the naming is important!
    def __init__(self, dimensions, output_dim):
        super().__init__()
        print(">", dimensions)
        self.wide = nn.Linear(dimensions.one_hot_dim, output_dim)
        self.deep = nn.Sequential(
            nn.Linear(dimensions.embedding_dim + dimensions.numerical_dim, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim),
        )

    def forward(self, batch: Dict[str, Tensor]) -> Tensor:
        one_hot = batch[ONE_HOT_KEY]
        embedding = batch[EMBEDDING_KEY]
        numerical = batch[NUMERICAL_KEY]
        wide_output = self.wide(one_hot)
        deep_output = self.deep(torch.cat([embedding, numerical], dim=1))
        return wide_output + deep_output

> As the comment says, the naming of the `__init__` argument, `dimensions`, is important! Because `carefree-learn` will then know that you require the full `dimensions` information, instead of a single `input_dim`.

We can run it and see if it works as expected:

In [9]:
m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_wide_and_deep",
    output_dim=out_dim,
    is_classification=True,
    # encoding settings
    encoding_settings=encoding_settings,
    # debug setting, indicating that we only train for one step
    fixed_steps=1,
)

> Dimensions(
    merged_dim    = 26
    one_hot_dim   = 13
    embedding_dim = 8
    numerical_dim = 5
)
Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
MLModel                                                                                                           
  Encoder                                    [-1, 8]                      [[-1, 13], [-1, 8]]                   56
    ModuleDict-1                                                                                                  
      OneHot                                    [-1]                                 [-1, 13]                    0
        Lambda                                  [-1]                                 [-1, 13]                    0
    ModuleDict-0                                                                   

Bravo! Everything works like a charm! 🥳

### Custom Configurations

So far we've introduced the 'dimension system' of the ML Models in `carefree-learn`, but you might want to know how to use custom hyper-parameters in your own models. For example, to specify `use_bias` in `my_linear`:

In [10]:
@cflearn.register_ml_module("my_linear3", allow_duplicate=False)
class MyLinear3(nn.Module):
    def __init__(self, input_dim, output_dim, *, use_bias: bool):
        super().__init__()
        self.net = nn.Linear(input_dim, output_dim, bias=use_bias)
    
    def forward(self, net):
        return self.net(net)

If you use `my_linear3` directly without specifying configurations, `carefree-learn` will throw an error:

In [11]:
try:
    m = cflearn.api.fit_ml(
        x,
        y,
        core_name="my_linear3",
        output_dim=out_dim,
        is_classification=True,
        # debug setting, indicating that we only train for one step
        fixed_steps=1,
    )
except TypeError as err:
    print(err)

__init__() missing 1 required keyword-only argument: 'use_bias'


As the error indicates, we are missing `use_bias` to initialize the `MyLinear3` module. To fix it, we can add a `core_config` part to the `fit_ml` API:

In [12]:
m = cflearn.api.fit_ml(
    x,
    y,
    core_name="my_linear3",
    # Add This!
    core_config=dict(use_bias=False),
    output_dim=out_dim,
    is_classification=True,
    # debug setting, indicating that we only train for one step
    fixed_steps=1,
)

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
MLModel                                                                                                           
  _                                                                                                               
    MyLinear3                                [-1, 8]                                  [-1, 2]                   16
      Linear                                 [-1, 8]                                  [-1, 2]                   16
Total params: 16
Trainable params: 16
Non-trainable params: 0
------------------------------------------------------------------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00
------

As shown above, the trainable parameters of `Linear` is only `16`, which means the `bias` has indeed been set to `False`.

> We put `use_bias` after a `*` to make it a keyword-only argument. It is not forced to do so, but in general it's recommended because:
> - It will make your module easier to understand when it is used by others.
> - It can separate the `carefree-learn`'s 'dimension system' from your own custom configurations.

## Other Models

Besides ML Models, customizing other models (e.g. CV models) with `register_module` API is almost the same as writing custom `nn.Module`. For example, let's build a simple image classification model from scratch:

In [13]:
@cflearn.register_module("my_image_classifier", allow_duplicate=True)
class MyImageClassifier(nn.Module):
    def __init__(self, in_channels, img_size, output_dim):
        super().__init__()
        flat_dim = in_channels * img_size ** 2
        self.net = nn.Sequential(
            nn.Linear(flat_dim, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim),
        )
    
    def forward(self, net):
        # flatten the input first
        net = net.view(net.shape[0], -1)
        return self.net(net)

Unlike ML Models, for any other models, `carefree-learn` will not 'inject' any pre-defined arguments to your `nn.Module`, so it will be safe & clean! 😉

> In fact, even for ML Models, you can ignore the 'dimension system' completely! Just avoid using names like `in_dim` / `input_dim` / `out_dim` / `output_dim` and everything will be fine.

We can play around with the `my_image_classifier` model on the famous `MNIST` dataset with `fit_cv` API:

In [14]:
# get MNIST data with `cflearn`'s predefined API
data = cflearn.cv.MNISTData(batch_size=2, transform="to_tensor")
# use `fit_cv` API for training
cflearn.api.fit_cv(
    # This first argument passed to `fit_cv` is complicated and hard to explain briefly
    # So we will cover its details in another article (the `Data Recipes`)
    data,
    # this is the name of your model
    model_name="my_image_classifier",
    # these (and only these) settings will go into your model's __init__ method
    model_config={"in_channels": 1, "img_size": 28, "output_dim": 10},
    # these are some training settings
    loss_name="cross_entropy",
    metric_names="acc",
    # debug setting, indicating that we only use a small portion of data to do validation
    valid_portion=1.0e-5,
    # debug setting, indicating that we only train for one step
    fixed_steps=1,
)

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
_                                                                                                                 
  MyImageClassifier                  [-1, 1, 28, 28]                                 [-1, 10]              101,770
    Sequential                             [-1, 784]                                 [-1, 10]              101,770
      Linear-0                             [-1, 784]                                [-1, 128]              100,480
      ReLU                                 [-1, 128]                                [-1, 128]                    0
      Linear-1                             [-1, 128]                                 [-1, 10]                1,290
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
--

<cflearn.api.cv.pipeline.CarefreePipeline at 0x2027ecd3130>

You might notice that the naming is slightly different from ML Models:
- `core_name` -> `model_name`
- `core_config` -> `model_config`

This is because ML Models will 'wrap' customized models under the `MLModel`, which means they serve as the `core` of `MLModel`. That's why we use `core_name` & `core_config`. But for other situations, the customized models will be left as-is, so we use `model_name` & `model_config`.

# Complex Models

For complicated tasks where a 'single-stage' model cannot satisfy, we will need to define 'complex' models in which you have full control of the training step and the evaluation step.

With the help of `inspect`, `carefree-learn` will let you **CHOOSE** the functionalities you need. So you can not only define a 'complex' model easily, but can also leverage a full set of powerful mechanisms such as [`amp`](https://pytorch.org/docs/stable/amp.html), gradient norm clipping, learning rate schedulings and so on.

In the following sections, we will use the famous `GAN` model to illustrate the concepts. We will start from a simple implementation, and then gradually increase the complexity.

## Preparations

In [15]:
from cflearn.protocol import StepOutputs
from cflearn.protocol import MetricsOutputs
from cflearn.constants import INPUT_KEY
from cflearn.constants import PREDICTIONS_KEY
from cflearn.misc.toolkit import to_device
from cflearn.misc.toolkit import interpolate
from cflearn.misc.toolkit import toggle_optimizer
from cflearn.modules.blocks import Lambda
from cflearn.modules.blocks import UpsampleConv2d

class GANLoss(nn.Module):
    def __init__(self):  # type: ignore
        super().__init__()
        self.loss = nn.BCEWithLogitsLoss()
        self.register_buffer("real_label", torch.tensor(1.0))
        self.register_buffer("fake_label", torch.tensor(0.0))

    def forward(self, predictions: Tensor, target_is_real: bool) -> Tensor:
        target = self.real_label if target_is_real else self.fake_label
        target = target.expand_as(predictions)
        loss = self.loss(predictions, target)
        return loss

def make_generator(in_channels, img_size, latent_dim):
    latent_channels = 64
    latent_wh = img_size // 4
    mapped_dim = latent_channels * latent_wh ** 2
    return nn.Sequential(
        nn.Linear(latent_dim, mapped_dim),
        nn.LeakyReLU(0.2, inplace=True),
        Lambda(lambda t: t.view(-1, latent_channels, latent_wh, latent_wh), name="reshape"),
        UpsampleConv2d(latent_channels, latent_channels, kernel_size=3, padding=1, factor=2),
        nn.LeakyReLU(0.2, inplace=True),
        UpsampleConv2d(latent_channels, latent_channels, kernel_size=3, padding=1, factor=2),
        nn.LeakyReLU(0.2, inplace=True),
        nn.Conv2d(latent_channels, in_channels, kernel_size=7, padding="same"),
        nn.Sigmoid(),
    )

def make_discriminator(in_channels, img_size):
    latent_dim = 64 * (img_size // 4) ** 2
    return nn.Sequential(
        nn.Conv2d(in_channels, 64, 3, stride=2, padding=1),
        nn.LeakyReLU(0.2, inplace=True),
        nn.Dropout(0.4),
        nn.Conv2d(64, 64, 3, stride=2, padding=1),
        nn.LeakyReLU(0.2, inplace=True),
        nn.Dropout(0.4),
        Lambda(lambda t: t.view(t.shape[0], latent_dim), name="reshape"),
        nn.Linear(latent_dim, 1),
    )

## Simple Complex Models

We'll first show you how to implement a simple `GAN` model, which can illustrate a minimum subset of the provided functionalities:

In [16]:
@cflearn.register_custom_module("my_gan0", allow_duplicate=False)
class MyGAN0(cflearn.CustomModule):
    def __init__(self, in_channels, img_size, latent_dim):
        super().__init__()
        self.latent_dim = latent_dim
        self.generator = make_generator(in_channels, img_size, latent_dim)
        self.discriminator = make_discriminator(in_channels, img_size)
        self.loss = GANLoss()

    def sample(self, num_samples):
        z = torch.randn(num_samples, self.latent_dim, device=self.device)
        return self.generator(z)

    def forward(self, net):
        self.sample(len(net))

    @property
    def g_parameters(self):
        return list(self.generator.parameters())

    @property
    def d_parameters(self):
        return list(self.discriminator.parameters())

    def train_step(self, batch, optimizers) -> StepOutputs:
        net = batch[INPUT_KEY]
        opt_g = optimizers["core.g_parameters"]
        opt_d = optimizers["core.d_parameters"]
        # generator step
        # `toggle_optimizer` can help you focus on `opt_g`'s gradients, and ignore other gradients
        with toggle_optimizer(self, opt_g):
            opt_g.zero_grad()
            sampled = self.sample(len(net))
            pred_fake = self.discriminator(sampled)
            g_loss = self.loss(pred_fake, target_is_real=True)
            g_loss.backward()
            opt_g.step()
        # discriminator step
        # `toggle_optimizer` can help you focus on `opt_d`'s gradients, and ignore other gradients
        with toggle_optimizer(self, opt_d):
            opt_d.zero_grad()
            pred_real = self.discriminator(net)
            loss_d_real = self.loss(pred_real, target_is_real=True)
            pred_fake = self.discriminator(sampled.detach())
            loss_d_fake = self.loss(pred_fake, target_is_real=False)
            d_loss = 0.5 * (loss_d_fake + loss_d_real)
            d_loss.backward()
            opt_d.step()
        # finalize
        return StepOutputs({}, {})

    def evaluate_step(self, loader) -> MetricsOutputs:
        loss_items = {}
        for i, batch in enumerate(loader):
            # `to_device` can help you put the tensors in a `dict` to a specific device
            batch = to_device(batch, self.device)
            net = batch[INPUT_KEY]
            sampled = self.sample(len(net))
            pred_fake = self.discriminator(sampled)
            g_loss = self.loss(pred_fake, target_is_real=True)
            pred_real = self.discriminator(net)
            d_loss = self.loss(pred_real, target_is_real=True)
            loss_items.setdefault("g", []).append(g_loss.item())
            loss_items.setdefault("d", []).append(d_loss.item())
        # gather
        mean_loss_items = {k: sum(v) / len(v) for k, v in loss_items.items()}
        loss = sum(mean_loss_items.values())
        mean_loss_items[cflearn.LOSS_KEY] = loss
        return MetricsOutputs(-loss, mean_loss_items)

This `GAN` model covers most of the important concepts that you need to know to build your own 'complex' models. Let's dive into them step by step.

### `__init__` / `forward`

They are the easiest parts, because their behaviours are the same as the behaviours mentioned in the [Other Models](#Other-Models) section.

### `g_parameters` / `d_parameters`

When using 'complex' models, we often need to use more than one optimizer (otherwise the 'one-stage' models should already satisfy the needs). In this case, we need to specify what parameters should each optimizer focuses on. In `carefree-learn`, we can do so by simply defining each group of parameters under a `property` and setup the `optimizer_settings` config in the `fit_*` APIs (e.g. `fit_ml` & `fit_cv`).

For the example above, we defined a `g_parameters` property and a `d_parameters` property, so the `optimizer_settings` should be something like:

```python
cflearn.api.fit_cv(
    ...,
    optimizer_settings = dict(
        core.g_parameters=dict(
            optimizer="adam",
            scheduler="warmup",
        ),
        core.d_parameters=dict(
            optimizer="adam",
            scheduler="warmup",
        ),
    ),
    ...,
)
```

You might notice that the keys of `optimizer_settings` is prefixed with `core.`, this is because `carefree-learn` again 'wrapped' your models in order to integrate them into its pipelines.

### `train_step`

Now comes to the hard part: the training step. There are many complex functionalities lie behind this method, but for this simple example we only used two of them: `batch` & `optimizers`. These two arguments are likely to be the minimum requirements of your `train_step`, because in general you should at least:
1. Do some forward calculations with your model and the input data (the `batch`).
2. Do some backward updates with the `optimizers`.

Let's break the `train_step` implementation above to see what is going on line by line:

```python
class StepOutputs(NamedTuple):
    forward_results: Dict[str, Tensor]
    loss_dict: Dict[str, float]

class MyGAN0:
    ...
    def train_step(self, batch, optimizers) -> StepOutputs:
        net = batch[INPUT_KEY]
        opt_g = optimizers["core.g_parameters"]
        opt_d = optimizers["core.d_parameters"]
        ...
```

The first line:

```python
def train_step(self, batch, optimizers) -> StepOutputs:
```

Indicates that:
1. We 'requires' `batch` & `optimizers` for our `train_step`.
2. We need to return a `StepOutputs` at the end of this `train_step`.

> For some later sections, you will find that we will 'require' more and more stuffs for our `train_step`. It is at your wish to 'require' what you need from the [Full set of Functionalities](#Full-set-of-Functionalities)!
>
> Unlike some other methods, the order of the arguments in `train_step` is not important, but the naming is. So you can actually write

```python
def train_step(self, optimizers, batch) -> StepOutputs:
    ...
```

The second line:

```python
net = batch[INPUT_KEY]
```

extracted the input tensor from the `batch`.

The third & fourth line:

```python
opt_g = optimizers["core.g_parameters"]
opt_d = optimizers["core.d_parameters"]
```

extracted the corresponding optimizer of each group of parameters. You might notice that the key of `optimizers` matches the key of `optimizer_settings`, which is pretty reasonable.

The rest of `train_step`'s implementations are basically some typical `GAN` implementations, but the `finalize` part still needs to pay some attention to:
```python
# finalize
forward_results = {PREDICTIONS_KEY: sampled}
loss_dict = {
    "g": g_loss.item(),
    "d": d_loss.item(),
    "d_fake": loss_d_fake.item(),
    "d_real": loss_d_real.item(),
}
return StepOutputs(forward_results, loss_dict)
```

There are two `dict` constructed here: the `forward_results` and the `loss_dict`. The `forward_results` is not utilized in `carefree-learn` yet, and the `loss_dict` is rarely used as well, except:
1. validation dataset is not provided, so `carefree-learn` will use this `loss_dict` to calculate metrics.
2. `mlflow` callback is enabled, so `carefree-learn` will record these losses.

> So for simplicity, it's completely OK to return two empty `dict` here:

```python
# finalize
return StepOutputs({}, {})
```

### `evaluate_step`

In fact in `GAN` models, the `evaluate_step` is not very important, because it is well known that it's hard to find a suitable metric for `GAN`s. However, we can still cover the basic concepts with the above example.

> You might be wondering why we did not use `.eval()` method to turn modules into evaluation mode. Thats because `carefree-learn` will handle this for you outside `evaluate_step`!

The first line:

```python
def evaluate_step(self, loader) -> MetricsOutputs:
```

Indicates that:
1. We 'requires' `loader` for our `evaluate_step`
2. We need to return a `MetricsOutputs` at the end of this `evaluate_step`.

> The `loader` is likely to be the minimum requirements of your `evaluate_step`, because in general you will need access to the dataset.
>
> For some later sections, you will find that we will 'require' more and more stuffs for our `evaluate_step`. It is at your wish to 'require' what you need from the [Full set of Functionalities](#Full-set-of-Functionalities)!

The rest of `evaluate_step`'s implementations are just calculating the losses of each batch and record them in `loss_items`, but the `gather` part still needs to pay some attention to:

```python
# gather

# this calculates the mean of each recorded loss
mean_loss_items = {k: sum(v) / len(v) for k, v in loss_items.items()}
# this sum up the recorded loss
loss = sum(mean_loss_items.values())
mean_loss_items[cflearn.LOSS_KEY] = loss
# we want the `loss` to be lower, so the `final_score` should be the opposite of the `loss`
# for verbose / logging purpose, we can return all recorded losses as `metric_values`
# in this case, they will be printed in the console & logged to our log file
return MetricsOutputs(final_score=-loss, metric_values=mean_loss_items)
```

### Summary

And that's it! We've covered most of the basic concepts we need to know to build a 'complex' model. To sum up, we need to:
1. Implement `__init__` / `forward` methods as usual.
2. Define some properties that return a certain group of parameters.
  1. These property names should be prefixed with `core.` and pass to the `optimizer_settings` as keys.
  2. We can get the corresponding optimizers in `train_step` with these keys.
3. We should handle `zero_grad`, `backward`, `step` and all training step stuffs by our own in `train_step` method.
4. We should calculate the `final_score` and return some other metrics for verbose / logging purpose.

### Run it!

The model defined above can actually work pretty well!

In [17]:
# change `num_workers` to 2 to run faster!
data = cflearn.cv.MNISTData(batch_size=16, num_workers=0, transform="to_tensor")
cflearn.api.fit_cv(
    data,
    "my_gan0",
    {"in_channels": 1, "img_size": 28, "latent_dim": 64},
    fixed_steps=1,  # comment out this line if you want to run a full experiment!
    optimizer_settings={
        "core.g_parameters": {
            "optimizer": "adam",
            "scheduler": "linear",
            "optimizer_config": {
                "lr": 0.0002,
                "betas": [0.5, 0.999],
            },
            "scheduler_config": {
                "start_epoch": 100,
                "end_epoch": 200,
            },
        },
        "core.d_parameters": {
            "optimizer": "adam",
            "scheduler": "linear",
            "optimizer_config": {
                "lr": 0.0002,
                "betas": [0.5, 0.999],
            },
            "scheduler_config": {
                "start_epoch": 100,
                "end_epoch": 200,
            },
        },
    },
    cuda=None,  # change this to your gpu_id if you have gpus!
)

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
_                                                                                                                 
  MyGAN0                                                                                                          
    Sequential-0                            [-1, 64]                          [-1, 1, 28, 28]              280,833
      Linear                                [-1, 64]                               [-1, 3136]              203,840
      LeakyReLU-0                         [-1, 3136]                               [-1, 3136]                    0
      Lambda                              [-1, 3136]                           [-1, 64, 7, 7]                    0
      UpsampleConv2d-0                [-1, 64, 7, 7]                      

<cflearn.api.cv.pipeline.CarefreePipeline at 0x2020e1c2850>

Here are some images generated when I ran a full experiment:

<img src="./static/my_gan0.png" alt="my_gan0" width="200"/>

They are not perfect, but can already demonstrate that the implemented model is correctly integrated into the `carefree-learn`'s pipeline.

> You might be wondering why I can generate images during training, that's because I implemented an additional `callback`. Please refer to the [Appendix](#ImageCallback) section for more details.

## Full set of Functionalities



# Appendix

## ML Encodings

What makes Machine Learning tasks different from other tasks is that some of the data-preprocessing methods are 'trainable' (e.g. embedding). In this case, we need to integrate these methods into our models rather than simply put them in a separate place.

It is OK to implement these methods in our custom models every time when we need them, but that will cause a **LOT** of boilerplate codes. `carefree-learn` therefore extracted them into an `Encoder` module, and exposed its settings in the APIs for you to utilize it easily.

The definitions related to the `Encoder` are pretty simple:

```python
class EncodingSettings(NamedTuple):
    dim: int
    methods: Union[str, List[str]] = "embedding"
    method_configs: Optional[Dict[str, Any]] = None

class Encoder(nn.Module):
    def __init__(
        self,
        settings: Dict[int, EncodingSettings],
        *,
        # there are a few more kwargs here, which will be covered in the `Optimizations` section
        ...
    ):
        ...
```

The `settings` of the `Encoder` is what we should mainly pay attention to: it's a mapping that maps the column index to its corresponding `EncodingSettings`.

For example, if:
- Our input features, `x`, has 10 columns, then the column index should be [0, 1, 2, ..., 9]
- The `0`th, `5`th & `7`th column are categorical columns, and they have `5`, `7` & `11` unique values respectively.
- We want to apply `one_hot` to the `0`th & `5`th column.
- We want to apply `embedding` to the `5`th & `7`th column.

Then the corresponding setup should be:

In [18]:
from cflearn.models.ml.encoders import Encoder
from cflearn.models.ml.encoders import EncodingSettings

Encoder(
    {
        0: EncodingSettings(5),
        5: EncodingSettings(7, ["one_hot", "embedding"]),
        7: EncodingSettings(11, "one_hot"),
    },
)

Encoder(
  (embeddings): ModuleDict(
    (-1): Embedding(
      (core): Lambda(embedding: 12 -> 4)
    )
  )
  (one_hot_encoders): ModuleDict(
    (5): OneHot(
      (core): Lambda(one_hot_7)
    )
    (7): OneHot(
      (core): Lambda(one_hot_11)
    )
  )
  (embedding_dropout): Dropout(p=0.2, inplace=False)
)

You might notice that the `one_hot_encoders` part is pretty straight forward: we initialized a `ModuleDict` which mapped the column index into its corresponding `OneHot` encoder, and the dimension matches exactly to the number of unique values. However, the `embeddings` part is a little weird: we only initialized one `Embedding` module, and the dimension is `12`, which is exactly `5+7` - the sum of the number of unique values.

This is due to a special mechanism in `carefree-learn` - the `fast_embedding` mechanism. We will cover the details in the [next section](#Optimizations), for now let's just see how to disable this mechanism and make our `Encoder` looks 'normal':

In [19]:
Encoder(
    {
        0: EncodingSettings(5),
        5: EncodingSettings(7, ["one_hot", "embedding"]),
        7: EncodingSettings(11, "one_hot"),
    },
    config={
        "use_fast_embedding": False,
    }
)

Encoder(
  (embeddings): ModuleDict(
    (0): Embedding(
      (core): Lambda(embedding: 5 -> 4)
    )
    (5): Embedding(
      (core): Lambda(embedding: 7 -> 4)
    )
  )
  (one_hot_encoders): ModuleDict(
    (5): OneHot(
      (core): Lambda(one_hot_7)
    )
    (7): OneHot(
      (core): Lambda(one_hot_11)
    )
  )
  (embedding_dropout): Dropout(p=0.2, inplace=False)
)

Great! Now the `embedding` part looks exactly the same as the `one_hot_encoders` part: we initialized a `ModuleDict` which mapped the column index into its corresponding `Embedding` encoder, and the dimension matches exactly to the number of unique values.

### Optimizations

`carefree-learn` not only provides *carefree* APIs for easier usages, but also did quite a few optimizations to make training on tabular datasets faster than other similar libraries. In this section we'll introduce some techniques `carefree-learn` adopted under the hood, and will show how much performance boost we've obtained with them.

#### One Hot Encoding

A `one_hot` encoding basically encodes categorical features as a one-hot numeric array, as defined in [`sklearn`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). Suppose we have 5 classes in total, then:

$$
\text{OneHot}(0) = [1,0,0,0,0] \\
\text{OneHot}(3) = [0,0,0,1,0]
$$

We can figure out that this kind of encoding is **static**, which means it will not change during the training process. In this case, we can cache down all the encodings and access them through indexing. This will speed up the encoding process for ~60x:

> You need to install `scikit-learn` to run the following examples.

In [20]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

num_data = 10000
num_classes = 10
batch_size = 128

x = np.random.randint(0, num_classes, num_data).reshape([-1, 1])
enc = OneHotEncoder(sparse=False).fit(x)
x_one_hot = enc.transform(x)

target_indices = np.random.permutation(num_data)[:batch_size]
x_target = x[target_indices]

assert np.allclose(x_one_hot[target_indices], enc.transform(x_target))
%timeit x_one_hot[target_indices]
%timeit enc.transform(x_target)

3.19 µs ± 18.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
222 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


> Although caching can boost performance, it is at the cost of consuming much more memories. A better solution should be caching sparse tensors instead of dense ones, but `PyTorch` has not supported sparsity good enough. See [Sparsity](#Sparsity) section for more details.

#### Embedding

An `embedding` encoding actually borrows from **N**atual **L**anguage **P**rocessing (**NLP**) where they converted (sparse) input words into dense embeddings with embedding look up. It is quite trivial to turn categorical features into embeddings with the same look up techniques, but tabular datasets hold a different property compared with **NLP**: tabular datasets will maintain many embedding tables because they have different categorical features with different number of values, while in **NLP** it only need to maintain one embedding table in most cases.

Since `embedding` is a **dynamic** encoding which contains trainable parameters, we cannot cache them beforehand like we did to `one_hot`. However, we can still optimize it with *fast embedding*. A *fast embedding* basically unifies the embedding dimension of different categorical features, so one unified embedding table is sufficient for the whole `embedding` process.

There's one more thing we need to take care of when applying *fast embedding*: we need to *increment* the values of each categorical features. Here's a minimal example to illustrate this. Suppose we have two categorical features ($x_1, x_2$) with 2 and 3 classes respectively, then our embedding table will contain 5 rows:

$$
\begin{bmatrix}
    \text{---} \text{---} \ v_1 \ \text{---} \text{---} \\
    \text{---} \text{---} \ v_2 \ \text{---} \text{---} \\
    \text{---} \text{---} \ v_3 \ \text{---} \text{---} \\
    \text{---} \text{---} \ v_4 \ \text{---} \text{---} \\
    \text{---} \text{---} \ v_5 \ \text{---} \text{---}
\end{bmatrix}
$$

In this table, the first two rows belong to $x_1$, while the last three rows belong to $x_2$. However, as we defined above, $x_1\in\{0,1\}$ and $x_2\in\{0,1,2\}$. In order to assign $v_3,v_4,v_5$ to $x_2$, we need to *increment* $x_2$ by $2$ (which is the number of choices $x_1$ could have). After *increment*, we have $x_2\in\{2,3,4\}$ so it can successfully look up $v_3,v_4,v_5$.

Note that the *incremented* indices are **static**, so `carefree-learn` will cache these indices to avoid duplicate calculations when *fast embedding* is applied.

Since the embedding dimensions are unified, *fast embedding* actually reduces the flexibility a little bit, but it can speed up the encoding process for ~16x:

In [21]:
import math
import torch
import numpy as np
from torch.nn import Embedding
from cflearn.misc.toolkit import to_torch

dim = 20
batch_size = 256

features = []
embeddings = []
for i in range(dim):
    # 5, 10, 15, 20
    num_classes = math.ceil((i + 1) / 5) * 5
    x = np.random.randint(0, num_classes, batch_size).reshape([-1, 1])
    embedding = Embedding(num_classes, 1)
    embeddings.append(embedding)
    features.append(x)

fast_embedding = Embedding(250, 1)
tensor = to_torch(np.hstack(features)).to(torch.long)

def f1():
    return fast_embedding(tensor)

def f2():
    embedded = []
    for i, embedding in enumerate(embeddings):
        embedded.append(embedding(tensor[..., i:i+1]))
    return torch.cat(embedded, dim=1)

assert f1().shape == f2().shape
%timeit f1()
%timeit f2()

30.8 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
598 µs ± 7.49 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


> Theoratically, `embedding` encoding is nothing more than a `one_hot` encoding followed by a linear projection, so it should be fast enough if we apply sparse matrix multiplications between `one_hot` encodings and a block diagnal `embedding` look up table. However as mentioned in [One Hot Encoding](#One-Hot-Encoding) section, `PyTorch` has not supported sparsity good enough. See [Sparsity](#Sparsity) section for more details.

#### Sparsity

It is quite trivial that the `one_hot` encoding actually outputs a sparse matrix with sparsity equals to:

$$
1-\frac{1}{\text{num_classes}}
$$

So the sparsity will exceed 90% when `num_classes` is greater than 10, therefore it is quite natural to think of leveraging sparse data structures to cache these `one_hot` encodings. What's better is that the `embedding` encoding could be represented as sparse matrix multiplications between `one_hot` encodings and a block diagnal `embedding` look up table, so **THEORATICALLY** (🤣) we could reuse the `one_hot` encodings to get the `embedding` encodings efficiently.

Unfortunately, although [`scipy`](https://docs.scipy.org/doc/scipy/reference/sparse.html) supports sparse matrices pretty well, `pytorch` has not yet supported them good enough. So we'll stick to the dense solutions mentioned above, but will switch to the sparse ones iff `pytorch` releases some fancy sparsity supports!

## `ImageCallback`

`carefree-learn` implements a powerful `callback` system to meet various requirements, so we will cover its details in another article (the `Callback Recipes`). For now, we will focus on a tiny subset of this system and see what can we achieve by implementing our own `ImageCallback`.

Recall that in the [Simple Complex Models](#Run-it!) section, I mentioned that by implementing an additional `callback`, we can generate images during training, which is extremely useful for **C**omputer **V**ision tasks.

With the help of `ImageCallback`, it is pretty straightforward to implement this:

In [22]:
import os
from cflearn.misc.toolkit import save_images
from cflearn.misc.toolkit import eval_context
from cflearn.misc.internal_.callbacks import ImageCallback

@ImageCallback.register("my_gan0_callback")
class MyGAN0Callback(ImageCallback):
    # `ImageCallback` provides a `num_keep` argument, which let you specify how many histories do you want to keep
    def __init__(self, num_keep: int = 25):
        super().__init__(num_keep)

    def log_artifacts(self, trainer) -> None:
        # this is for ddp only
        if not self.is_rank_0:
            return None
        # get a batch from the validation dataset
        batch = next(iter(trainer.validation_loader))
        # put this batch to the correct device
        batch = to_device(batch, trainer.device)
        # extract the original inputs
        original = batch[INPUT_KEY]
        # extract our `MyGAN0`
        model = trainer.model.core
        # use the internal method `_prepare_folder` to prepare our image folder
        image_folder = self._prepare_folder(trainer)
        # use the util function `save_images` to save original inputs to `png` file
        save_images(original, os.path.join(image_folder, "original.png"))
        # generate images and save them to `png` file
        with eval_context(model):
            sampled = model.sample(len(original))
        save_images(sampled, os.path.join(image_folder, "sampled.png"))

In order to use it, we can simply specify the `callback_names` argument:

In [23]:
cflearn.api.fit_cv(
    data,
    "my_gan0",
    {"in_channels": 1, "img_size": 28, "latent_dim": 64},
    fixed_steps=1,  # comment out this line if you want to run a full experiment!
    optimizer_settings={
        "core.g_parameters": {
            "optimizer": "adam",
            "scheduler": "linear",
            "optimizer_config": {
                "lr": 0.0002,
                "betas": [0.5, 0.999],
            },
            "scheduler_config": {
                "start_epoch": 100,
                "end_epoch": 200,
            },
        },
        "core.d_parameters": {
            "optimizer": "adam",
            "scheduler": "linear",
            "optimizer_config": {
                "lr": 0.0002,
                "betas": [0.5, 0.999],
            },
            "scheduler_config": {
                "start_epoch": 100,
                "end_epoch": 200,
            },
        },
    },
    # Add this!
    callback_names="my_gan0_callback",
    cuda=None,  # change this to your gpu_id if you have gpus!
)

Layer (type)                             Input Shape                             Output Shape    Trainable Param #
------------------------------------------------------------------------------------------------------------------------
_                                                                                                                 
  MyGAN0                                                                                                          
    Sequential-0                            [-1, 64]                          [-1, 1, 28, 28]              280,833
      Linear                                [-1, 64]                               [-1, 3136]              203,840
      LeakyReLU-0                         [-1, 3136]                               [-1, 3136]                    0
      Lambda                              [-1, 3136]                           [-1, 64, 7, 7]                    0
      UpsampleConv2d-0                [-1, 64, 7, 7]                      

<cflearn.api.cv.pipeline.CarefreePipeline at 0x2027ec1f580>

You should then be able to see the folder structure becomes:

```text
--- _logs
  // timestamp
  |-- 2022-07-25_10-57-24-738951
    // where your images are generated
    |-- images
      // number of step when the images are generated, `-1` indicates that it's the final one
      |-- -1
        // original inputs
        |-- original.png
        // generated images
        |-- sampled.png
```