This notebook file is extracted from https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb as a reference to help you understand the concept of NeMo.

# Foundations of NeMo
NeMo models leverage [PyTorch Lightning](https://github.com/PyTorchLightning/pytorch-lightning) Module, and are compatible with the entire PyTorch ecosystem. This means that users have the full flexibility of using the higher level APIs provided by PyTorch Lightning (via Trainer), or write their own training and evaluation loops in PyTorch directly (by simply calling the model and the individual components of the model).

For NeMo developers, a "Model" is the neural network(s) as well as all the infrastructure supporting those network(s), wrapped into a singular, cohesive unit. As such, all NeMo models are constructed to contain the following out of the box (at the bare minimum, some models support additional functionality too!) - 

 -  Neural Network architecture - all of the modules that are required for the model.

 -  Dataset + Data Loaders - all of the components that prepare the data for consumption during training or evaluation.

 -  Preprocessing + Postprocessing - all of the components that process the datasets so they can easily be consumed by the modules.

 -  Optimizer + Schedulers - basic defaults that work out of the box, and allow further experimentation with ease.

 - Any other supporting infrastructure - tokenizers, language model configuration, data augmentation etc.

In [5]:
import nemo
nemo.__version__

'1.6.2'

# NeMo Collections
NeMo is sub-divided into a few fundamental collections based on their domains - `asr`, `nlp`, `tts`. When you performed the import nemo statement above, none of the above collections were imported. This is because you might not need all of the collections at once, so NeMo allows partial imports of just one or more collection, as and when you require them.

In [6]:
import nemo.collections.asr as nemo_asr
import nemo.collections.nlp as nemo_nlp
import nemo.collections.tts as nemo_tts

In [3]:
asr_models = [model for model in dir(nemo_asr.models) if model.endswith("Model")]
asr_models

['ASRModel',
 'EncDecCTCModel',
 'EncDecClassificationModel',
 'EncDecRNNTBPEModel',
 'EncDecRNNTModel',
 'EncDecSpeakerLabelModel']

In [4]:
nemo_asr.models.EncDecCTCModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ),
 PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ),
 PretrainedModelInfo(
 	pre

In [6]:
nlp_models = [model for model in dir(nemo_nlp.models) if model.endswith("Model")]
nlp_models

['BERTLMModel',
 'BertDPRModel',
 'BertJointIRModel',
 'DuplexDecoderModel',
 'DuplexTaggerModel',
 'DuplexTextNormalizationModel',
 'EntityLinkingModel',
 'GLUEModel',
 'IntentSlotClassificationModel',
 'MTEncDecModel',
 'PunctuationCapitalizationModel',
 'QAModel',
 'SGDQAModel',
 'Text2SparqlModel',
 'TextClassificationModel',
 'TokenClassificationModel',
 'TransformerLMModel',
 'ZeroShotIntentModel']

In [7]:
tts_models = [model for model in dir(nemo_tts.models) if model.endswith("Model")]
tts_models

['AlignerModel',
 'DegliModel',
 'EDMel2SpecModel',
 'FastPitchHifiGanE2EModel',
 'FastPitchModel',
 'FastSpeech2HifiGanE2EModel',
 'FastSpeech2Model',
 'GlowTTSModel',
 'GriffinLimModel',
 'HifiGanModel',
 'MelGanModel',
 'MelPsuedoInverseModel',
 'SqueezeWaveModel',
 'Tacotron2Model',
 'TalkNetDursModel',
 'TalkNetPitchModel',
 'TalkNetSpectModel',
 'TwoStagesModel',
 'UniGlowModel',
 'WaveGlowModel']

In [18]:
QuartzNet = nemo_asr.models.EncDecCTCModel.from_pretrained('QuartzNet15x5Base-En')

[NeMo I 2022-02-11 06:43:27 cloud:56] Found existing object /root/.cache/torch/NeMo/NeMo_1.5.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.
[NeMo I 2022-02-11 06:43:27 cloud:62] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.5.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo
[NeMo I 2022-02-11 06:43:27 common:728] Instantiating model from pre-trained checkpoint
[NeMo I 2022-02-11 06:43:28 features:265] PADDING: 16
[NeMo I 2022-02-11 06:43:28 features:282] STFT using torch
[NeMo I 2022-02-11 06:43:29 save_restore_connector:149] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.5.0/QuartzNet15x5Base-En/2b066be39e9294d7100fb176ec817722/QuartzNet15x5Base-En.nemo.


In [19]:
QuartzNet.summarize()

      rank_zero_deprecation(
    


  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 18.9 M
2 | decoder           | ConvASRDecoder                    | 29.7 K
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     
------------------------------------------------------------------------
18.9 M    Trainable params
0         Non-trainable params
18.9 M    Total params
75.698    Total estimated model params size (MB)

# Neural Module
Wait, what's `NeuralModule`? Where is the wonderful `torch.nn.Module`? 

`NeuralModule` is a subclass of `torch.nn.Module`, and it brings with it a few additional functionalities.

In addition to being a `torch.nn.Module`, thereby being entirely compatible with the PyTorch ecosystem, it has the following capabilities - 

1) `Typing` - It adds support for `Neural Type Checking` to the model. `Typing` is optional but quite useful, as we will discuss below!

2) `Serialization` - Remember the `OmegaConf` config dict and YAML config files? Well, all `NeuralModules` inherently supports serialization/deserialization from such config dictionaries!

3) `FileIO` - This is another entirely optional file serialization system. Does your `NeuralModule` require some way to preserve data that can't be saved into a PyTorch checkpoint? Write your serialization and deserialization logic in two handy methods! **Note**: When you create the final NeMo Model, this will be implemented for you! Automatic serialization and deserialization support of NeMo models!

In [2]:
from nemo.core import NeuralModule

class MyEmptyModule(NeuralModule):

  def forward(self):
    print("Neural Module ~ hello world!")

NOTE! Installing ujson may make loading annotations faster.


In [3]:
x = MyEmptyModule()
x()

Neural Module ~ hello world!


# Neural types

Neural Types perform semantic checks for modules and models inputs/outputs. They contain information about:

- Semantics of what is stored in the tensors. For example, logits, logprobs, audiosignal, embeddings, etc.

- Axes layout, semantic and (optionally) dimensionality. For example: [Batch, Time, Channel]

In [1]:
from nemo.core.neural_types import NeuralType
from nemo.core.neural_types import *
from nemo.core import NeuralModule
from nemo.core import typecheck
import torch
import nemo

NOTE! Installing ujson may make loading annotations faster.


In [24]:
# Case 1:
embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)
x = torch.randint(high=10, size=(1, 5))
print("x :", x)
print("embedding(x) :", embedding(x).shape)

x : tensor([[3, 5, 6, 9, 8]])
embedding(x) : torch.Size([1, 5, 30])


In [21]:
# Case 2
lstm = torch.nn.LSTM(1, 30, batch_first=True)
x = torch.randn(1, 5, 1)
print("x :", x)
print("lstm(x) :", lstm(x)[0].shape)  # Let's take all timestep outputs of the LSTM

x : tensor([[[ 1.0551],
         [ 0.7760],
         [-0.4139],
         [-0.6781],
         [ 0.9876]]])
lstm(x) : torch.Size([1, 5, 30])


As you can see, the output of Case 1 is an embedding of shape [1, 5, 30], and the output of Case 2 is an LSTM output (state h over all time steps), also of the same shape [1, 5, 30].

Do they have the same shape? **Yes**.

If we do a Case 1 .shape == Case 2 .shape, will we get True as an output? **Yes**.

Do they represent the same concept? **No**.

The ability to recognize that the two tensors do not represent the same semantic information is precisely why we utilize Neural Types. It contains the information of both the shape and the semantic concept of what that tensor represents. If we performed a neural type check between the two outputs of those tensors, it would raise an error saying semantically they were different things.

![neuraltype_moti](./resources/images/neuraltype_motiviaton.png)

# Neural types - Usages

In [8]:
class EmbeddingModule(NeuralModule):
  def __init__(self):
    super().__init__()
    self.embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=30)

  @typecheck()
  def forward(self, x):
    return self.embedding(x)

  @property
  def input_types(self):
    return {
        'x': NeuralType(axes=('B', 'T'), elements_type=Index())
    }

  @property
  def output_types(self):
    return {
        'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EmbeddedTextType())
    }

In [11]:
embedding_module = EmbeddingModule()

In [9]:
class LSTMModule(NeuralModule):
  def __init__(self):
    super().__init__()
    self.lstm = torch.nn.LSTM(1, 30, batch_first=True)

  @typecheck()
  def forward(self, x):
    return self.lstm(x)

  @property
  def input_types(self):
    return {
        'x': NeuralType(axes=('B', 'T', 'C'), elements_type=SpectrogramType())
    }

  @property
  def output_types(self):
    return {
        'y': NeuralType(axes=('B', 'T', 'C'), elements_type=EncodedRepresentation()),
        'h_c': [NeuralType(axes=('D', 'B', 'C'), elements_type=EncodedRepresentation())],
    }

In [10]:
lstm_module = LSTMModule()

`@typecheck()` is a simple decorator that takes any class that inherits `Typing` (NeuralModule does this for us) and adds the two default properties of `input_types` and `output_types`, which by default returns None.

The `@typecheck()` decorator's explicit use ensures that, by default, neural type checking is **disabled**. NeMo does not wish to intrude on the development process of models. So users can "opt-in" to type checking by overriding the two properties. Therefore, the decorator ensures that users are not burdened with type checking before they wish to have it.

So what is `@typecheck()`? Simply put, you can wrap **any** function of a class that inherits `Typing` with this decorator, and it will look up the definition of the types of that class and enforce them. Typically, `torch.nn.Module` subclasses only implement `forward()` so it is most common to wrap that method, but `@typecheck()` is a very flexible decorator.

As we see above, `@typecheck()` enforces the types. How then, do we provide this type of information to NeMo? 

By overriding `input_types` and `output_types` properties of the class, we can return a dictionary mapping a string name to a `NeuralType`.

In the above case, we define a `NeuralType` as two components - 

- `axes`: This is the semantic information of the carried by the axes themselves. The most common axes information is from single character notation.

> `B` = Batch <br>
> `C` / `D` - Channel / Dimension (treated the same) <br>
> `T` - Time <br>
> `H` / `W` - Height / Width <br>

- `elements_type`: This is the semantic information of "what the tensor represents". All such types are derived from the basic `ElementType`, and merely subclassing `ElementType` allows us to build a hierarchy of custom semantic types that can be used by NeMo!

Here, we declare that the input is an element_type of `Index` (index of the character in the vocabulary) and that the output is an element_type of `EmbeddedTextType` (the text embedding)

In [13]:
# Case 1
x1 = torch.randint(high=10, size=(1, 5))
print("x :", x1)
print("embedding(x) :", embedding_module(x=x1).shape)

x : tensor([[5, 0, 9, 6, 4]])
embedding(x) : torch.Size([1, 5, 30])


In [14]:
# Case 2
x2 = torch.randn(1, 5, 1)
y2, (h, c) = lstm_module(x=x2)
print("x :", x2)
print("lstm(x) :", y2.shape)  # The output of the LSTM RNN
print("hidden state (h) :", h.shape)  # The first hidden state of the LSTM RNN
print("hidden state (c) :", c.shape)  # The second hidden state of the LSTM RNN

x : tensor([[[0.1985],
         [0.4477],
         [0.2269],
         [0.8882],
         [0.2836]]])
lstm(x) : torch.Size([1, 5, 30])
hidden state (h) : torch.Size([1, 1, 30])
hidden state (c) : torch.Size([1, 1, 30])


In [16]:
emb_out = embedding_module(x=x1)
lstm_out = lstm_module(x=x2)[0]

assert hasattr(emb_out, 'neural_type')
assert hasattr(lstm_out, 'neural_type')

print("Embedding tensor :", emb_out.neural_type)
print("LSTM tensor :", lstm_out.neural_type)

Embedding tensor : axes: (batch, time, dimension); elements_type: EmbeddedTextType
LSTM tensor : axes: (batch, time, dimension); elements_type: EncodedRepresentation


# Constructing a NeMo Model
https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb

There is a great post giving a gentle introduction about PyTorch Lightning: [click here](https://towardsdatascience.com/from-pytorch-to-pytorch-lightning-a-gentle-introduction-b371b7caaf09).

According to this post, `PyTorch` is extremely easy to use to build complex AI models. But once the research gets complicated and things like multi-GPU training, 16-bit precision and TPU training get mixed in, users are likely to introduce bugs.

`PyTorch Lightning` solves exactly this problem. Lightning structures your PyTorch code so it can abstract the details of training. This makes AI research scalable and fast to iterate on.

In [55]:
# import PyTorch modules 

from typing import List, Set, Dict, Tuple, Optional
import torch
import torch.nn as nn
from torch.nn import functional as F

Below is Transformer modules of miniGPT(https://github.com/karpathy/minGPT)

In [59]:
class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here but I am including an
    explicit implementation here to show that there is nothing too scary here.
    """

    def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):
        super().__init__()
        assert n_embd % n_head == 0
        self.n_head = n_head
        # key, query, value projections for all heads
        self.key = nn.Linear(n_embd, n_embd)
        self.query = nn.Linear(n_embd, n_embd)
        self.value = nn.Linear(n_embd, n_embd)
        # regularization
        self.attn_drop = nn.Dropout(attn_pdrop)
        self.resid_drop = nn.Dropout(resid_pdrop)
        # output projection
        self.proj = nn.Linear(n_embd, n_embd)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("mask", torch.tril(torch.ones(block_size, block_size))
                                     .view(1, 1, block_size, block_size))
    def forward(self, x, layer_past=None):
        B, T, C = x.size()

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        k = self.key(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = self.query(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = self.value(x).view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_drop(att)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.resid_drop(self.proj(y))
        return y
    

class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, n_embd, block_size, n_head, attn_pdrop, resid_pdrop):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.attn = CausalSelfAttention(n_embd, block_size, n_head, attn_pdrop, resid_pdrop)
        self.mlp = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(resid_pdrop),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

In [52]:
# import other modules for porting

import nemo
import pytorch_lightning as ptl
from nemo.core import NeuralModule
from nemo.core import ModelPT
from omegaconf import OmegaConf

- `NeuralModule` is a subclass of `torch.nn.Module`, and it brings with it a few additional functionalities.
- `ModelPT` is equivalent of `LightningModule`
- `omegaconf`

In [53]:
class AttentionType(EncodedRepresentation):
  """Basic Attention Element Type"""

class SelfAttentionType(AttentionType):
  """Self Attention Element Type"""

class CausalSelfAttentionType(SelfAttentionType):
  """Causal Self Attention Element Type"""

In [56]:
# from PyTorch to PyTorch Lightning

class PTLGPT(ptl.LightningModule):
  def __init__(self,
                 # model definition args
                 vocab_size: int, # size of the vocabulary (number of possible tokens)
                 block_size: int, # length of the model's context window in time
                 n_layer: int, # depth of the model; number of Transformer blocks in sequence
                 n_embd: int, # the "width" of the model, number of channels in each Transformer
                 n_head: int, # number of heads in each multi-head attention inside each Transformer block
                 # model optimization args
                 learning_rate: float = 3e-4, # the base learning rate of the model
                 weight_decay: float = 0.1, # amount of regularizing L2 weight decay on MatMul ops
                 betas: Tuple[float, float] = (0.9, 0.95), # momentum terms (betas) for the Adam optimizer
                 embd_pdrop: float = 0.1, # \in [0,1]: amount of dropout on input embeddings
                 resid_pdrop: float = 0.1, # \in [0,1]: amount of dropout in each residual connection
                 attn_pdrop: float = 0.1, # \in [0,1]: amount of dropout on the attention matrix
                 ):
        super().__init__()

        # save these for optimizer init later
        self.learning_rate = learning_rate
        self.weight_decay = weight_decay
        self.betas = betas

        # input embedding stem: drop(content + position)
        self.tok_emb = nn.Embedding(vocab_size, n_embd)
        self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))
        self.drop = nn.Dropout(embd_pdrop)
        # deep transformer: just a sequence of transformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) for _ in range(n_layer)])
        # decoder: at the end one more layernorm and decode the answers
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f

        self.block_size = block_size
        self.apply(self._init_weights)

        print("number of parameters: %e" % sum(p.numel() for p in self.parameters()))

  def forward(self, idx):
      b, t = idx.size()
      assert t <= self.block_size, "Cannot forward, model block size is exhausted."

      # forward the GPT model
      token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector
      position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector
      x = self.drop(token_embeddings + position_embeddings)
      x = self.blocks(x)
      x = self.ln_f(x)
      logits = self.head(x)

      return logits

  def get_block_size(self):
      return self.block_size

  def _init_weights(self, module):
      """
      Vanilla model initialization:
      - all MatMul weights \in N(0, 0.02) and biases to zero
      - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0
      """
      if isinstance(module, (nn.Linear, nn.Embedding)):
          module.weight.data.normal_(mean=0.0, std=0.02)
          if isinstance(module, nn.Linear) and module.bias is not None:
              module.bias.data.zero_()
      elif isinstance(module, nn.LayerNorm):
          module.bias.data.zero_()
          module.weight.data.fill_(1.0)

In [58]:
m = PTLGPT(vocab_size=100, block_size=32, n_layer=1, n_embd=32, n_head=4)

number of parameters: 2.019200e+04


A NeMo Model constructor generally accepts only two things - 

1) `cfg`: An OmegaConf DictConfig object that defines precisely the components required by the model to define its neural network architecture, data loader setup, optimizer setup, and any additional components needed for the model itself.

2) `trainer`: An optional Trainer from PyTorch Lightning if the NeMo model will be used for training. It can be set after construction (if required) using the `set_trainer` method. For this notebook, we will not be constructing the config for the Trainer object.

# Refactoring the Embedding module

In [60]:
class GPTEmbedding(NeuralModule):
  def __init__(self, vocab_size: int, n_embd: int, block_size: int, embd_pdrop: float = 0.0):
    super().__init__()

    # input embedding stem: drop(content + position)
    self.tok_emb = nn.Embedding(vocab_size, n_embd)
    self.pos_emb = nn.Parameter(torch.zeros(1, block_size, n_embd))
    self.drop = nn.Dropout(embd_pdrop)

  @typecheck()
  def forward(self, idx):
    b, t = idx.size()
    
    # forward the GPT model
    token_embeddings = self.tok_emb(idx) # each index maps to a (learnable) vector
    position_embeddings = self.pos_emb[:, :t, :] # each position maps to a (learnable) vector
    x = self.drop(token_embeddings + position_embeddings)
    return x

  @property
  def input_types(self):
    return {
        'idx': NeuralType(('B', 'T'), Index())
    }

  @property
  def output_types(self):
    return {
        'embeddings': NeuralType(('B', 'T', 'C'), EmbeddedTextType())
    }

### Refactoring the Encoder

In [62]:
class GPTTransformerEncoder(NeuralModule):
  def __init__(self, n_embd: int, block_size: int, n_head: int, n_layer: int, attn_pdrop: float = 0.0, resid_pdrop: float = 0.0):
    super().__init__()

    self.blocks = nn.Sequential(*[Block(n_embd, block_size, n_head, attn_pdrop, resid_pdrop) 
                                  for _ in range(n_layer)])
    
  @typecheck()
  def forward(self, embed):
    return self.blocks(embed)

  @property
  def input_types(self):
    return {
        'embed': NeuralType(('B', 'T', 'C'), EmbeddedTextType())
    }

  @property
  def output_types(self):
    return {
        'encoding': NeuralType(('B', 'T', 'C'), CausalSelfAttentionType())
    }

### Refactoring the Decoder

In [63]:
class GPTDecoder(NeuralModule):
  def __init__(self, n_embd: int, vocab_size: int):
    super().__init__()
    self.ln_f = nn.LayerNorm(n_embd)
    self.head = nn.Linear(n_embd, vocab_size, bias=False) # no need for extra bias due to one in ln_f

  @typecheck()
  def forward(self, encoding):
    x = self.ln_f(encoding)
    logits = self.head(x)
    return logits

  @property
  def input_types(self):
    return {
        'encoding': NeuralType(('B', 'T', 'C'), EncodedRepresentation())
    }
  
  @property
  def output_types(self):
    return {
        'logits': NeuralType(('B', 'T', 'C'), LogitsType())
    }

### Refactoring the NeMo GPT Model

In [65]:
class AbstractNeMoGPT(ModelPT):
  def __init__(self, cfg: OmegaConf, trainer: ptl.Trainer = None):
      super().__init__(cfg=cfg, trainer=trainer)

      # input embedding stem: drop(content + position)
      self.embedding = self.from_config_dict(self.cfg.embedding)
      # deep transformer: just a sequence of transformer blocks
      self.encoder = self.from_config_dict(self.cfg.encoder)
      # decoder: at the end one more layernorm and decode the answers
      self.decoder = self.from_config_dict(self.cfg.decoder)

      self.block_size = self.cfg.embedding.block_size
      self.apply(self._init_weights)

      print("number of parameters: %e" % self.num_weights)

  @typecheck()
  def forward(self, idx):
      b, t = idx.size()
      assert t <= self.block_size, "Cannot forward, model block size is exhausted."

      # forward the GPT model
      # Remember: Only kwargs are allowed !
      e = self.embedding(idx=idx)
      x = self.encoder(embed=e)
      logits = self.decoder(encoding=x)

      return logits

  def get_block_size(self):
      return self.block_size

  def _init_weights(self, module):
      """
      Vanilla model initialization:
      - all MatMul weights \in N(0, 0.02) and biases to zero
      - all LayerNorm post-normalization scaling set to identity, so weight=1, bias=0
      """
      if isinstance(module, (nn.Linear, nn.Embedding)):
          module.weight.data.normal_(mean=0.0, std=0.02)
          if isinstance(module, nn.Linear) and module.bias is not None:
              module.bias.data.zero_()
      elif isinstance(module, nn.LayerNorm):
          module.bias.data.zero_()
          module.weight.data.fill_(1.0)

  @property
  def input_types(self):
    return {
        'idx': NeuralType(('B', 'T'), Index())
    }

  @property
  def output_types(self):
    return {
        'logits': NeuralType(('B', 'T', 'C'), LogitsType())
    }

# Creating a config for a Model
At first glance, not much changed compared to the PyTorch Lightning implementation above. Other than the constructor, which now accepts a config, nothing changed at all.

NeMo operates on the concept of a NeMo Model being accompanied by a corresponding config dict (instantiated as an OmegaConf object). This enables us to prototype the model by utilizing Hydra rapidly. This includes various other benefits - such as hyperparameter optimization and serialization/deserialization of NeMo models.

Let's look at how actually to construct such config objects.

In [67]:
# model definition args (required)
# ================================
# vocab_size: int # size of the vocabulary (number of possible tokens)
# block_size: int # length of the model's context window in time
# n_layer: int # depth of the model; number of Transformer blocks in sequence
# n_embd: int # the "width" of the model, number of channels in each Transformer
# n_head: int # number of heads in each multi-head attention inside each Transformer block  

# model definition args (optional)
# ================================
# embd_pdrop: float = 0.1, # \in [0,1]: amount of dropout on input embeddings
# resid_pdrop: float = 0.1, # \in [0,1]: amount of dropout in each residual connection
# attn_pdrop: float = 0.1, # \in [0,1]: amount of dropout on the attention matrix

In [68]:
from omegaconf import MISSING

In [71]:
# Let's create a utility for building the class path
def get_class_path(cls):
  return f'{cls.__module__}.{cls.__name__}'

In [69]:
common_config = OmegaConf.create({
    'vocab_size': MISSING,
    'block_size': MISSING,
    'n_layer': MISSING,
    'n_embd': MISSING,
    'n_head': MISSING,
})

In [72]:
embedding_config = OmegaConf.create({
    '_target_': get_class_path(GPTEmbedding),
    'vocab_size': '${model.vocab_size}',
    'n_embd': '${model.n_embd}',
    'block_size': '${model.block_size}',
    'embd_pdrop': 0.1
})

encoder_config = OmegaConf.create({
    '_target_': get_class_path(GPTTransformerEncoder),
    'n_embd': '${model.n_embd}',
    'block_size': '${model.block_size}',
    'n_head': '${model.n_head}',
    'n_layer': '${model.n_layer}',
    'attn_pdrop': 0.1,
    'resid_pdrop': 0.1
})

decoder_config = OmegaConf.create({
    '_target_': get_class_path(GPTDecoder),
    # n_embd: int, vocab_size: int
    'n_embd': '${model.n_embd}',
    'vocab_size': '${model.vocab_size}'
})

`_target_` is usually a full classpath to the actual class in the python package/user local directory. It is required for Hydra to locate and instantiate the model from its path correctly.

In general, when developing models, we don't often change the encoder or the decoder, but we do change the hyperparameters of the encoder and decoder.

This notation helps us keep the Model level declaration of the forward step neat and precise. It also logically helps us demark which parts of the model can be easily replaced - in the future, we can easily replace the encoder with some other type of self-attention block or the decoder with an RNN or 1D-CNN neural module (as long as they have the same Neural Type definition as the current blocks).

In [73]:
model_config = OmegaConf.create({
    'model': common_config
})

# Then let's attach the sub-module configs
model_config.model.embedding = embedding_config
model_config.model.encoder = encoder_config
model_config.model.decoder = decoder_config

In [74]:
print(OmegaConf.to_yaml(model_config))

model:
  vocab_size: ???
  block_size: ???
  n_layer: ???
  n_embd: ???
  n_head: ???
  embedding:
    _target_: __main__.GPTEmbedding
    vocab_size: ${model.vocab_size}
    n_embd: ${model.n_embd}
    block_size: ${model.block_size}
    embd_pdrop: 0.1
  encoder:
    _target_: __main__.GPTTransformerEncoder
    n_embd: ${model.n_embd}
    block_size: ${model.block_size}
    n_head: ${model.n_head}
    n_layer: ${model.n_layer}
    attn_pdrop: 0.1
    resid_pdrop: 0.1
  decoder:
    _target_: __main__.GPTDecoder
    n_embd: ${model.n_embd}
    vocab_size: ${model.vocab_size}



In [80]:
# Let's work on a copy of the model config and update it before we send it into the Model.
cfg = copy.deepcopy(model_config)
# Let's set the values of the config (for some plausible small model)
cfg.model.vocab_size = 100
cfg.model.block_size = 128
cfg.model.n_layer = 1
cfg.model.n_embd = 32
cfg.model.n_head = 4
print(OmegaConf.to_yaml(cfg))

model:
  vocab_size: 100
  block_size: 128
  n_layer: 1
  n_embd: 32
  n_head: 4
  embedding:
    _target_: __main__.GPTEmbedding
    vocab_size: ${model.vocab_size}
    n_embd: ${model.n_embd}
    block_size: ${model.block_size}
    embd_pdrop: 0.1
  encoder:
    _target_: __main__.GPTTransformerEncoder
    n_embd: ${model.n_embd}
    block_size: ${model.block_size}
    n_head: ${model.n_head}
    n_layer: ${model.n_layer}
    attn_pdrop: 0.1
    resid_pdrop: 0.1
  decoder:
    _target_: __main__.GPTDecoder
    n_embd: ${model.n_embd}
    vocab_size: ${model.vocab_size}



In [81]:
m = AbstractNeMoGPT(cfg.model)

TypeError: Can't instantiate abstract class AbstractNeMoGPT with abstract methods list_available_models, setup_training_data, setup_validation_data

You will note that we added the `Abstract` tag for a reason to this NeMo Model and that when we try to instantiate it - it raises an error that we need to implement specific methods.

1) `setup_training_data` & `setup_validation_data` - All NeMo models should implement two data loaders - the training data loader and the validation data loader. Optionally, they can go one step further and also implement the `setup_test_data` method to add support for evaluating the Model on its own.

Why do we enforce this? NeMo Models are meant to be a unified, cohesive object containing the details about the neural network underlying that Model and the data loaders to train, validate, and optionally test those models.

In doing so, once the Model is created/deserialized, it would take just a few more steps to train the Model from scratch / fine-tune/evaluate the Model on any data that the user provides, as long as this user-provided dataset is in a format supported by the Dataset / DataLoader that is used by this Model!

2) `list_available_models` - This is a utility method to provide a list of pre-trained NeMo models to the user from the cloud.

Typically, NeMo models can be easily packaged into a tar file (which we call a .nemo file in the earlier primer notebook). These tar files contain the model config + the pre-trained checkpoint weights of the Model, and can easily be downloaded from some cloud service. 

For this notebook, we will not be implementing this method.

In [83]:
from nemo.core.classes.common import PretrainedModelInfo

class BasicNeMoGPT(AbstractNeMoGPT):

  @classmethod
  def list_available_models(cls) -> PretrainedModelInfo:
    return None

  def setup_training_data(self, train_data_config: OmegaConf):
    self._train_dl = None
  
  def setup_validation_data(self, val_data_config: OmegaConf):
    self._validation_dl = None
  
  def setup_test_data(self, test_data_config: OmegaConf):
    self._test_dl = None


In [84]:
m = BasicNeMoGPT(cfg.model)

number of parameters: 2.326400e+04


For further implementation, refer to https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb