# 5. Deep Learning Computation

### 5.1 Layers and Blocks

When we first started talking about neural nets, we introduced linear models with a single output. Here, the entire model consists of just a single neuron. By itself a single neuron takes some set of inputs, generates a corresponding (scalar) output, and has a set of associated parameters that can be updated to optimize some objective function of interest. Then, once we started thinking about networks with multiple outputs, we leveraged vectorized arithmetic, we showed how we could use linear algebra to efficiently express an entire layer of neurons. Layers too expect some inputs, generate corresponding outputs, and are described by a set of tunable parameters.

When we worked through softmax regression, a single layer was itself the model. However, when we subsequently introduced multilayer perceptrons, we developed models consisting of multiple layers. One interesting property of multilayer neural networks is that the *entire model* and its *constituent layers* share the same basic structure. The model takes the true inputs (as stated in the problem formulation), outputs predictions of the true outputs, and possesses parameters (the combined set of all parameters from all layers). Likewise any individual constituent layer in a multilayer perceptron ingests inputs (supplied by the previous layer), generates outputs (which form the inputs to the subsequent layer), and possesses a set of tunable parameters that are updated with respect to the ultimate objective (using the signal that flows backwards through the subsequent layers).

While you might think that neurons, layers, and models give us enough abstractions to go about our business, it turns out that we will often want to express our model in terms of components that are larger than an individual layer. For example, when designing models, like ResNet-152, which possess hundreds (152, thus the name) of layers, implementing the network one layer at a time can grow tedious. Moreover, this concern is not just hypothetical - such deep networks dominate numerous application areas, especially when training data is abundant.

To facilitate the implementation of networks consisting of components of arbitrary complexity, we introduce a new flexible concept: a neural network *block*. A block could describe a single neuron, a high-dimensional layer, or an arbitrarily-complex component consisting of multiple layers. From a software development perspective, a `Block` is a class. Any subclass of Block must define a method called `forward` that transforms its input into output, and must store any necessary parameters. Note that some Blocks do not require parameters at all! Finally, a Block must possess a `backward` method, for purposes of calculating gradients. Fortunately, due to some behind-the-scenes magic supplied by the autograd package defining our own Block typically requires only that we worry about parameters and the forward function.

To recap, let's revisit the Blocks that played a role in the implementation of the multilayer perceptron. The following code generates a network with one fully-connected hidden layer containing 256 units followed by a ReLU activation function, and then another fully-connected layer consisting of 10 units (with no activation function, since this is the output layer).

In [1]:
from mxnet import np, npx
from mxnet.gluon import nn

npx.set_np()

In [2]:
x = np.random.uniform(size = (2,20))

In [3]:
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

array([[ 0.06240272, -0.03268594,  0.02582652,  0.02254181, -0.03728798,
        -0.04253786,  0.00540613, -0.01364184, -0.09915452, -0.02272737],
       [ 0.02816678, -0.03341203,  0.03565667,  0.02506383, -0.04136416,
        -0.04941843,  0.01738529,  0.01081963, -0.09932576, -0.01176296]])

In this example, our model consists of an object returned by the `nn.Sequential` constructor. **After instantiating a `nn.Sequential` and storing the net variable, we repeatedly called its `add` method, appending layers in the order that they should be executed**.

**In short, `nn.Sequential` just defines a special kind of `Block`. Specifically, an `nn.Sequential` maintains a list of constituent Blocks, stored in a particular order. You might think of `nn.Sequential` as a type of meta-block. The `add` method simply facilitates the addition of each successive Block to the list**. Note that each of our layers are instances of the `Dense` class which is itself a subclass of `Block`. The `forward` function is also remarkably simple: it chains each Block in the list together, passing the output of each as the input to the next.

Before we dive in to implementing a custom block, let's briefly summarize the basic functionality that each `Block` must perform:

- 1) Ingest input data as arguments to its forward function
- 2) Generate an output via the value returned by its forward function. Note that the output may have a different shape from the input. For example, the first Dense layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
- 3) **Calculate the gradient of its output with respect to its input, which can be accessed via its backward method. Typically this happens automatically**
- 4) Store and provide access to those parameters necessary to execute the forward computation.
- 5) Initialize these parameters as needed.

### 5.1.1 A Custom Block

Perhaps the easiest way to develop intuition about how `nn.Block` works is to just dive right in and implement one ourselves. In the following snippet, instead of relying on the `nn.Sequential`, we just code up a block from scratch that implements a multilayer perceptron with one hidden layer, 256 hidden nodes, and 10 outputs.

Our MLP class below inherits the Block class. While we rely on some predefined methods in the parent class, we need to supply our own `__init__` and forward functions to uniquely define the behavior of our model

In [5]:
from mxnet.gluon import nn

In [6]:
class MLP(nn.Block):
    # Declare a layer with model parameters. Here, we declare two fully
    # connected layers
    def __init__(self, **kwargs):
        # Call the constructor of the MLP parent class Block to perform the 
        # necessary initialization. In this way, other function parameters can
        # also be specified when constructing an instance, such as the model parameters,
        # params, described in the following sections
        super(MLP, self).__init__(**kwargs)
        self.hidden = nn.Dense(256, activation='relu')
        self.output = nn.Dense(10)
        
    # Define the forward computation of the model, that is, how to return the
    # required model output based on the input x
    def forward(self, x):
        return self.output(self.hidden(x))

This code may be easiest to understand by working backwards from forward. Note that the forward method takes as input `x`. The forward method first evaluates `self.hidden(x)` to produce the hidden representation, passing this output as the input to the output layer `self.output(...)` 

The constituent layers of each MLP **must be instance-level variables**. Afrter all, if we instantiated two such models `net1` and `net2` and trained them on different data, we would expect them to represent two different learned models.


The `__init__` method is the most natural place to instantiate the layers that we subsequently invoke on each call to the forward method. Note that before getting on with the interesting parts, our customized `__init__` method must invoke the parents class's init method: `super(MLP, self)`. `__init__(**kwargs)` to save us from reimplementing boilerplate code applicable to most Blocks. Then, all that is left, is to instantiate our two Dense layers, assigning them to `self.hidden` and `self.output`, respectively. Again note that when dealing with standard functionality like this, we do not have to worry about backpropagation, since the `backward` method is generated for us automatically. The same goes for the `initialize` method. Let's try it out:

In [8]:
net = MLP()
net.initialize()
net(x)

array([[-0.03989593, -0.10414708,  0.06799038,  0.05245075,  0.02526059,
        -0.00640342,  0.04182098, -0.01665319, -0.02067345, -0.07863817],
       [-0.03612846, -0.07210436,  0.0915948 ,  0.0789077 ,  0.02494172,
        -0.01028664,  0.01732428, -0.02843242,  0.03772651, -0.06671704]])

As argued earlier, the primary virtue of the Block abstraction is its versatility. We can subclass `Block` to create layers (such as the Dense class provided by Gluon), entire models (such as the MLP class implemented above), or various components of intermediate complexity, a pattern that we will lean on heavily throughout the next chapters on CNN's. 

#### 5.1.2 The Sequential Block

As we described earlier, the Sequential class is also just a subclass of Block, **designed specifically for daisy-chaining other Blocks together**. All we need to do to implement our own MySequential block is to define a few convenience functions: 
- 1) An `add` method for appending Blocks one by one to a list.
- 2) A `forward` method to pass inputs through the chain of Blocks (in the order of addition)

The following MySequential class delivers the same functionality as Gluon's default Sequential class:

In [10]:
class MySequential(nn.Block):
    def __init__(self, **kwargs):
        super(MySequential, self).__init__(**kwargs)
        
    def add(self, block):
        # Here, block is an instance of a Block subclass, and we assume it has
        # a unique name. We save it in the member variable _children of the Block
        # class, and its type is OrderedDict. When the MySequential instance calls
        # the initialize function, the system automatically initializes all members of
        # children
        self._children[block.name] = block
        
    def forward(self, x):
        # OrderedDict guarantees that members will be traversed in the order
        # they were added.
        for block in self._children.values():
            x = block(x)
        return x

At its core is the `add` method. It adds any block to the ordered dictionary of children. These are then executed in sequence when forward propagation is invoked. Let's see what the MLP looks like now:

In [28]:
net = MySequential()
net.add(nn.Dense(256,activation = 'relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

array([[-0.0764568 , -0.01130233,  0.04952145, -0.04651388, -0.04131571,
        -0.05884131, -0.06213811,  0.01311471, -0.01379425, -0.02514282],
       [-0.05124623,  0.00711232, -0.00155934, -0.07555378, -0.06675333,
        -0.01762913,  0.00589084,  0.01447191, -0.04330775,  0.03317727]])

#### 5.1.3 Blocks with Code

Although the Sequential class can make model constructions easier, and you do not need to define the forward method, directly inheriting the Block class can greatly expand the flexibility of model construction. In particular, we will use Python's contril flow within the forward method. While we are at it, we need to introduce another concept, that of the *constant* parameter. **These are parameters that are not used when invoking backpropagation**. This sounds very abstract, but here's what is really going on. Assume that we have some function:

- $f(\textbf{x}, \textbf{w}) = 3 * \textbf{w}^T\textbf{x}$

In this case 3 is a constant parameter. We could change 3 to something else, say c, via

- $f(\textbf{x}, \textbf{w}) = c * \textbf{w}^T\textbf{x}$

Nothing has really changed, except that we can adjust the value of *c*. It is still a constant as far as $\textbf{w}$ and $\textbf{x}$ are concerned. However, since Gluon does not know about this beforehand, it is worth while to give it a hand (this makes the code go faster, since we are not sending the Gluon engine on a wild goose chase after a parameter that does not change). `get_constant` is the method that can be used to accomplish this. Let's see what this looks like in practice.

In [29]:
class FancyMLP(nn.Block):
    def __init__(self, **kwargs):
        super(FancyMLP, self).__init__(**kwargs)
        # Random weight parameters created with get_constant ARE NOT
        # ITERATED during training
        self.rand_weight = self.params.get_constant(
            'rand_weight', np.random.uniform(size = (20,20)))
        self.dense = nn.Dense(20, activation='relu')
        
    def forward(self, x):
        x = self.dense(x)
        # Use the constant parameters created, as well as the relu
        # and dot functions
        x = npx.relu(np.dot(x, self.rand_weight.data()) + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        x = self.dense(x)
        # Here in control flow, we need to call as scalar to return the scalar
        # for comparison
        while np.abs(x).sum() > 1:
            x /= 2
        if np.abs(x).sum() < 0.8:
            x *= 10
        return x.sum()

In this `FancyMLP` model, we used constant weight Rand_weight (**note that it is not a model parameter**), performed a matrix multiplication operation (np.dot), and reused the same Dense layer. Note that this is very different from using two dense layers with different sets of parameters. Instead, we used the same network twice. **Quite often in deep networks one also says that the parameters are *tied* when one wants to express that multiple parts of a network share the same parameters**. Let's see what happens if we construct our FancyMLP and feed data through it.

In [30]:
net = FancyMLP()
net.initialize()
net(x)

array(5.2637568)

There is no reason why we couldn't mix and match these ways of building a network. Obviously, the next example resembles more a chimera, or less charitably, a Rube Goldberg Machine. That said, it combines examples for building a block from individual blocks, which in turn, may be blocks themselves. Furthermore, we can even combine multiple strategies inside the same forward function. To demonstrate this, here's the network.

In [31]:
class NestMLP(nn.Block):
    def __init__(self, **kwargs):
        super(NestMLP, self).__init__(**kwargs)
        self.net = nn.Sequential()
        self.net.add(nn.Dense(64, activation = 'relu'),
                     nn.Dense(32, activation = 'relu'))
        self.dense = nn.Dense(16, activation = 'relu')
        
    def forward(self, x):
        return self.dense(self.net(x))

In [32]:
chimera = nn.Sequential()
chimera.add(NestMLP(), nn.Dense(20), FancyMLP())

In [33]:
chimera.initialize()
chimera(x)

array(0.9772054)

#### Summary:

- 1) Layers are blocks
- 2) Many layers can be a block
- 3) Many blocks can be a block
- 4) Code can be a block
- 5) Blocks take care of a lot of housekeeping, such as parameter initialization, backprop and related issues
- 6) Sequential concatenations of layers and blocks are handled by the eponymous Sequential block.

#### 5.2 Parameter Management

The ultimate goal of training deep networks is to find good parameter values for a given architecture. When everything is standard, the nn.Sequential class is a perfectly good tool for it. However, very few models are entirely standard and most scientists want to build things that are novel. This section shows how to manipulate parameters. In particular, the following aspects will be covered:

- Accessing parameters for debugging, diagnostics, to visualize them or to save them is the first step to understanding how to work with custom models.
- Second, we want to set them in specific ways, e.g., for initialization purposes. Se discuss the structure of parameter initializers.
- Last, we show how this knowledge can be put to good use by building networks that share some parameters.


Let's start with our trusty Multilayer Perceptron with a hidden layer. This will serve as our choice for demonstrating the features discussed above.

In [1]:
from mxnet import init, np, npx
from mxnet.gluon import nn

npx.set_np()

In [2]:
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize() # using the default initialization method

In [3]:
x = np.random.uniform(size = (2,20))
net(x) # forward pass

array([[ 0.06240272, -0.03268594,  0.02582652,  0.02254181, -0.03728798,
        -0.04253786,  0.00540613, -0.01364184, -0.09915452, -0.02272737],
       [ 0.02816678, -0.03341203,  0.03565667,  0.02506383, -0.04136416,
        -0.04941843,  0.01738529,  0.01081963, -0.09932576, -0.01176296]])

#### 5.2.1 Parameter Access

In the case of a Sequential class we can access the parameters with ease, simply by indexing each of the layers in the network. The params variable then contains the required data. Let's try this out in practice by inspecting the parameters of the first layer.

In [4]:
print(net[0].params)
print(net[1].params)

dense0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
)
dense1_ (
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)


**Note**: I couldn't wrap my head (or rather imagine) around the shapes of the weights that are presented above, so I took a deeper dive to correctly understand the forward pass. Okay, so why the parameter's dense12_weight shape is 256, 20? Because we specified in our nn.Dense 256 hidden units, and 20 because of the input features. So, 256 nodes "from" 20 features that were passed from the input layer. Next, why dense13_weight has the shape (10, 256)? Because we specified that we want 10 outputs, adn 256 because it is "coming from" the previous layer that had 256 hidden units.

The output tells us a number of things. First, the layer consists of two sets of parameters: `dense0_weight` and `dense0_bias`, as we would expect. They are both single precision and they have the necessary shapes that we would expect from the first layer, given that the input dimension is 20 and the output dimension is 256. In particular the names of the parameters are very useful since they allow us to identify parameters *uniquely* even in a network of hundreds of layers and with nontrivial structure. The second layer is structured accordingly.

#### Targeted Parameters

In order to do something useful with the parameters we need to access them, though. There are several ways to do this, ranging from simple to general. Let's look at some of them.

In [5]:
print(net[1].bias)
print(net[1].bias.data())

Parameter dense1_bias (shape=(10,), dtype=float32)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


The first returns the bias of the second layer. Since this is an object containing data, gradients, and additional information, **we need to request the data explicitly**. Note that the bias is all 0 since we initialized the bias to contain all zeros. Note that we can also access the parameters by name, such as 'dense0_weight'. This is possible since each layer comes with its owen parameter dictionary that can be accessed directly. Both methods are entirely equivalent but the first method leads to much more readable code. 

In [6]:
print(net[0].params['dense0_weight'])
print(net[0].params['dense0_weight'].data())

Parameter dense0_weight (shape=(256, 20), dtype=float32)
[[ 0.06700657 -0.00369488  0.0418822  ... -0.05517294 -0.01194733
  -0.00369594]
 [-0.03296221 -0.04391347  0.03839272 ...  0.05636378  0.02545484
  -0.007007  ]
 [-0.0196689   0.01582889 -0.00881553 ...  0.01509629 -0.01908049
  -0.02449339]
 ...
 [-0.02055008 -0.02618652  0.06762936 ... -0.02315108 -0.06794678
  -0.04618235]
 [ 0.02802853  0.06672969  0.05018687 ... -0.02206502 -0.01315478
  -0.03791244]
 [-0.00638592  0.00914261  0.06667828 ... -0.00800052  0.03406764
  -0.03954004]]


Note that the weights are nonzero. This is by design since they were randomly initialized when we constructed the network. `data` is not the only function that we can invoke. For instance, we can compute the gradient with respect to the parameters. It has the same shape as the weight. However, since we did not invoke backpropagation yet, the values are all 0.

In [7]:
net[0].weight.grad()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#### All Parameters at Once

Accessing parameters as described above can be a bit tedious, in particular if we have more complex blocks, or blocks of blocks (or even blocks of blocks of blocks), since we need to walk through the entire tree in reverse order to how the blocks were constructed. To avoid this, blocks come with a method `collect_params()` which grabs all parameters of a network in one dictionary such that we can traverse it with ease. It does so by iterating over all constituents of a block and calls collect_params() on subblocks as needed. To see the difference consider the following:

In [8]:
# parameters only for the first layer
print(net[0].collect_params())
# parameters of the entire network
print(net.collect_params())

dense0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
)
sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)


This provides us with a third way of accessing the parameters of the network. If we wanted to get the value of the bias term of the second layer we could simply use this:

In [9]:
net.collect_params()['dense1_bias'].data()

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

We are going to see how various blocks name their subblocks throughout (Sequential simply numbers them). This makes it very convenient to use regular expressions to filter out the required parameters.

In [10]:
print(net.collect_params('.*weight'))
print(net.collect_params('dense0.*'))

sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
)
sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
)


#### Rube Goldberg Strikes Again

Let's see how the parameter naming conventions work if we nest multiple blocks inside each other. For that we first define a function that produces blocks (a block factory, so to speak) and then we combine these inside yet larger blocks.

In [11]:
def block1():
    net = nn.Sequential()
    net.add(nn.Dense(32, activation='relu'))
    net.add(nn.Dense(16, activation='relu'))
    return net


def block2():
    net = nn.Sequential()
    for i in range(4):
        net.add(block1())
    return net

In [12]:
rgnet = nn.Sequential()
rgnet.add(block2())
rgnet.add(nn.Dense(10))
rgnet.initialize()
rgnet(x)

array([[-4.1923052e-09,  1.9830513e-09,  8.9443941e-10,  6.2912981e-09,
        -3.3241803e-09,  5.4330047e-09,  1.6013488e-09, -3.7408676e-09,
         8.5468486e-09, -6.4805539e-09],
       [-3.7507055e-09,  1.4866971e-09,  6.8314709e-10,  5.6925775e-09,
        -2.6349158e-09,  4.8626658e-09,  1.4280280e-09, -3.4603027e-09,
         7.4127913e-09, -5.7896128e-09]])

In [13]:
print(rgnet.collect_params)
print(rgnet.collect_params())

<bound method Block.collect_params of Sequential(
  (0): Sequential(
    (0): Sequential(
      (0): Dense(20 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (1): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (2): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (3): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
  )
  (1): Dense(16 -> 10, linear)
)>
sequential1_ (
  Parameter dense2_weight (shape=(32, 20), dtype=float32)
  Parameter dense2_bias (shape=(32,), dtype=float32)
  Parameter dense3_weight (shape=(16, 32), dtype=float32)
  Parameter dense3_bias (shape=(16,), dtype=float32)
  Parameter dense4_weight (shape=(32, 16), dtype=float32)
  Parameter dense4_bias (shape=(32,), dtype=float32)
  Parameter dense5_weight (shape=(16, 32), dtype=float32)
  Parameter dense5_

Since the layers are hierarchically generated, we can also access them accordingly. For instance, to access the first major block, within it the second subblock and then within it, in turn the bias of the first layer, we perform the following:

In [14]:
rgnet[0][1][0].bias.data()

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

#### 5.2.2 Parameter Initialization

Now that we know how to access the parameters, let's look at how to initialize them properly. We discussed the need for initialization in previous sections. By default, MXNet initializes the weight matrices **uniformly by drawing from** $U = [-0.07, 0.07]$ and the bias parameters are all set to 0. However, we often need to use other methods to initialize the weights. MXNet's `init` module provides a variety of preset initialization methods, but if we want something out of the ordinary, we need a bit of extra work

#### Built-in Initialization

Let's begin with the built-in initializers. The code below initializes all parameters with Gaussian random variables.

In [45]:
# force_reinit ensures that the variables are initialized again, regardless of
# whether they were already initialized previously
net.initialize(init = init.Normal(sigma = 0.01), force_reinit=True)

In [46]:
net[0].weight.data()[0]

array([ 0.00621872,  0.00547382, -0.00680312, -0.00243756,  0.00674617,
       -0.01452631,  0.00505583, -0.00882651, -0.00724038,  0.0057432 ,
       -0.0075844 ,  0.00208307, -0.00436246,  0.00037523,  0.00605163,
       -0.01103913, -0.00349048, -0.01203271, -0.00587942, -0.0059022 ])

If we wanted to initialize all parameters to 1, we could do this simply by changing the initializer to `Constant(1)`

In [47]:
net.initialize(init = init.Constant(1), force_reinit = True)
net[0].weight.data()[0]

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1.])

If we wanted to initialize only a specific parameter in a different manner, we can simply set the initializer only for the appropriate subblock (or parameter) for that matter. For instance, below we initialize the second layer to a constant value of 42 and we use the `Xavier` initializer for the weights of the first layer

In [62]:
net[1].collect_params()['dense1_weight'].data()

array([[42., 42., 42., ..., 42., 42., 42.],
       [42., 42., 42., ..., 42., 42., 42.],
       [42., 42., 42., ..., 42., 42., 42.],
       ...,
       [42., 42., 42., ..., 42., 42., 42.],
       [42., 42., 42., ..., 42., 42., 42.],
       [42., 42., 42., ..., 42., 42., 42.]])

In [63]:
net[1].initialize(init = init.Constant(42), force_reinit=True)
net[0].initialize(init = init.Xavier(), force_reinit=True)

print(net[1].weight.data()[0,0])
print(net[0].weight.data()[0])

42.0
[ 0.0325658  -0.02506374 -0.05946625 -0.02983668 -0.05573101 -0.14466211
  0.08818261  0.05468664  0.04162778  0.10555196 -0.00365587 -0.03599197
  0.11360577  0.11628401 -0.05024661  0.09044486 -0.11333492 -0.14139469
 -0.09055966 -0.01037596]


#### Custom Initialization

Sometimes, the initialization methods that we need are not provided in the init module. At this point, we can implement a subclass of the `Initializer` class so that we can use it like any other initialization method. Usually, we only need to implement the `_init_weight` function and modify the incoming ndarray according to the initial result. In the example below, we pick a decidedly bizarre and nontrivial distribution, just to prove the point. We draw the coefficients from the following distribution:

$ w = U[5,10]$ with probability $\frac{1}{4}$

$ w = 0$ with probability $\frac{1}{2}$

$ w = U[-10,-5]$ with probability $\frac{1}{4}$

In [64]:
class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        data[:] = np.random.uniform(-10,10, data.shape)
        data *= np.abs(data) >= 5

In [65]:
net.initialize(MyInit(), force_reinit=True)
net[0].weight.data()[0]

Init dense0_weight (256, 20)
Init dense1_weight (10, 256)


array([-0.       ,  0.       ,  0.       ,  0.       ,  0.       ,
        8.065113 , -0.       ,  5.2081738,  7.5088882,  9.696129 ,
        5.7685013, -0.       , -7.300396 , -5.4582067,  9.357521 ,
        0.       , -5.085011 ,  0.       ,  7.296194 ,  6.3765545])

If even this functionality is insufficient, we can set parameters directly. Since `data()` returns an ndarray we can access it just like any other matrix. A note for advanced users: if you want to adjust parameters within an autograd scope you need to use `set_data` to avoid confusing the automatic differentiation mechanics.

In [74]:
net[0].weight.data()[:] += 1
net[0].weight.data()[0,0] = 42
net[0].weight.data()[0]

array([42.       ,  1.       ,  1.       ,  1.       ,  1.       ,
        9.065113 ,  1.       ,  6.2081738,  8.508888 , 10.696129 ,
        6.7685013,  1.       , -6.300396 , -4.4582067, 10.357521 ,
        1.       , -4.085011 ,  1.       ,  8.296194 ,  7.3765545])

#### 5.2.3 Tied Parameters

In some cases, we want to share model parameters across multiple layers. For instance, when we want to find good word embeddings we may decide to use the same parameters both for encoding and decoding of words. Let's see how we can do this in MXNet. In the following, we allocate a dense layer and then use its parameters specifically to set those of another layer.

In [78]:
net = nn.Sequential()
# We need to give the shared layer a name such that we can reference its
# parameters
shared = nn.Dense(8, activation='relu') # Dense layer allocation

net.add(nn.Dense(8, activation='relu'),
        shared,
        nn.Dense(8, activation='relu', params = shared.params),
        nn.Dense(10))
net.initialize()

In [79]:
x = np.random.uniform(size = (2,20))
net(x)

array([[ 3.20007457e-05,  7.32462649e-06, -1.06070962e-04,
         7.88085435e-06, -3.94340168e-06,  6.45207547e-06,
        -1.30222645e-04,  2.85319184e-05, -1.68804618e-05,
         2.79861924e-05],
       [ 2.37204458e-05, -4.62006210e-05, -1.14139155e-04,
        -1.38660471e-06,  1.59751289e-05,  5.39458124e-05,
        -8.65445909e-05,  5.53990903e-05, -3.30448893e-05,
         4.20731558e-05]])

In [80]:
# Checking whether the parameters are the same
print(net[1].weight.data()[0] == net[2].weight.data()[0])
net[1].weight.data()[0, 0] = 100
# Make sure that they are actually the same object rather than just having the same 
# value
print(net[1].weight.data()[0] == net[2].weight.data()[0])

[ True  True  True  True  True  True  True  True]
[ True  True  True  True  True  True  True  True]


The above example shows that the parameters of the second and third layer are tied. They are identical rather than just being equal. That is, by changing one of the parameters the other changes, too. What happens to the gradients is quite ingenious. Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer **are accumulated in the `shared.params.grad()` during backpropagation**.

### 5.3 Deferred Initialization

In the previous examples we played fast and loose with setting up our networks. In particular, we did the following things that *shouldn't* work:

- We defined the network architecture with no regard to the input dimensionality
- We added layers without regard to the the output dimensionality of the previous layers
- We even "initialized" these parameters without knowing how many parameters were to initialize.

All of those things sound impossible and, indeed, they are. After all, there is no way MXNet (or any other framework for that matter) could predict what the input dimensionality of a network would be. Later on, when working with convolutional neural networks and images this problem will become even more pertinent, **since the input dimensionality (i.e. the resolution of an image) will affect the dimensionality of subsequent layers at a long range**. Hence, the ability to set parameters without the need to know at the time of writing the code what the dimensionality is can greatly simplify statistical modeling. In what follows, we will discuss how this works using initialization as an example. After all, we cannot initialize variables that we do not know exist.

#### 5.3.1 Instantiating a Network

In [2]:
from mxnet import init, np, npx
from mxnet.gluon import nn
npx.set_np()

In [3]:
def getnet():
    net = nn.Sequential()
    net.add(nn.Dense(256, activation='relu'))
    net.add(nn.Dense(10))
    return net

In [4]:
net = getnet()

At this point the network does not really know yet what the dimensionality of the various parameters should be. All one could tell at this point is that each layer needs weight and bias, albeit of **unspecified** dimensionality. If we try accessing the parameters, that is exactly what happens.

In [5]:
print(net.collect_params)
print(net.collect_params())

<bound method Block.collect_params of Sequential(
  (0): Dense(-1 -> 256, Activation(relu))
  (1): Dense(-1 -> 10, linear)
)>
sequential0_ (
  Parameter dense0_weight (shape=(256, -1), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, -1), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)


In particular, trying to access `net[0].weight.data()` at this point would trigger a runtime error stating that the network needs initializing **before** it can do anything. Let's see whether anything chages after we initialize the parameters.

In [6]:
net.initialize()
net.collect_params()

sequential0_ (
  Parameter dense0_weight (shape=(256, -1), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, -1), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)

As we can see, nothing really changed. Only once we provide the network with some data do we see a difference. Let's try it out.

In [7]:
x = np.random.uniform(size = (2,20))
net(x) # Forward Computation
net.collect_params()

sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)

The main difference to before is that as soon as we knew the input dimensionality, $\textbf{x} \in{\mathbb{R}^{20}}$ it was possible to define the weight matrix for the first layer, i.e. $\textbf{W}_1 \in{\mathbb{R}^{256x20}}$. With that out of the way, we can progress to the second layer, define its dimensionality to be 10x256 and so on through the computational graph and bind all the dimensions as they become available. Once this is known, we can proceed by initializing parameters. This is the solution to the three problems outlined above.

#### 5.3.2 Deferred Initialization in Practice

Now that we know how it works in theory, let's see when the initalization is actually triggered. In order to do so, we mock up an initializer which does nothing but report a debug message stating when it was invoked and with which parameters.

In [8]:
class MyInit(init.Initializer):
    def _init_weight(self, name, data):
        print('Init', name, data.shape)
        # Omitting the actual initialization logic here

        
net = getnet()
net.initialize(init = MyInit())

Note that although MyInit will print the information about the model parameters when it is called, the above initialize function does not print any information after it has been executed. Therefore there is no real initialization parameter when calling the initialize function. Next, we define the input and perform a forward calculation

In [9]:
x = np.random.uniform(size = (2,20))
y = net(x)

Init dense2_weight (256, 20)
Init dense3_weight (10, 256)


At this time, information on the model parameters is printed. When performing a forward calculation based on the input x, the system can automatically infer the shape of the weight parameters of all layers based on the shape of the input. Once the system has created these parameters, it calls the MyInit instance to initialize them **before proceeding with the forward calculation**.

Of course, **this initialization will only be called when completing the initial forward calculation**. After that, we will not re-initialize when we run the forward calculation `net(x)`, so the output of the MyInit instance will not be generated again.

In [10]:
y = net(x)

As mentioned at the beginning of this section, deferred initialization can also cause confusion. Before the first forward calculation, we were unable to directly manipulate the model parameters, for example, we could not use the data and set_data functions to get and modify the parameters. Therefore, we often force initialization by sending a sample observation through the network.

#### 5.3.3 Forced Initialization

Deferred initialization does not occur if the system knows the shape of all parameters when calling the `initialize` function. This can occur in two cases:
- We have already seen some data and we just want to reset the parameters
- We specified all input and output dimensions of the network when defining it

The first case works just fine, as illustrated in the example below

In [11]:
net.initialize(init=MyInit(), force_reinit=True)

Init dense2_weight (256, 20)
Init dense3_weight (10, 256)


The second case requires us to specify the remaining set of parameters when creating the layer. For instance, for dense layers we also need to specify the in_units so that the initialization can occur immediately once initialize is called.

In [12]:
net = nn.Sequential()
net.add(nn.Dense(256, in_units=20, activation='relu'))
net.add(nn.Dense(10, in_units=256))
net.initialize(init = MyInit())

Init dense4_weight (256, 20)
Init dense5_weight (10, 256)


#### Recap

- Deferred initialization is a good thing. It allows Gluon to set many things automatically and it removes a great source of errors from defining novel network architectures.
- We can override this by specifying all implicitly defined variables
- Initialization can be repeated (or forced) by setting the `force_reinit = True` flag.

### 5.4 Custom Layers

One of the reasons for the success of deep learning can be found in the wide range of layers that can be used in a deep network. This allows for a tremendous degree of customization and adaptation. For instance, scientists have invented layers for images, text, pooling, loops, dynamic programmin, even for computer programs. Sooner or later you will encounter a layer that does not exist yet in Gluon, or even better, you will eventually invent a new layer that works well for your problem at hand. This is time to build a custom layer. This section shows how

#### 5.4.1 Layers without Parameters

Since this is slightly intricate, we start with a custom layer (also known as Block) that does not have any inherent parameters. Our first step is very similar to when we introduced blocks in previous sections. The following `CenteredLayer` class constructs a layer that subtracts the mean from the input. We build it by inheriting from the Block class and implementing a forward method.

In [13]:
from mxnet import gluon, np, npx
from mxnet.gluon import nn

npx.set_np()

In [14]:
class CenteredLayer(nn.Block):
    def __init__(self, **kwargs):
        super(CenteredLayer, self).__init__(**kwargs)
        
    def forward(self, x):
        return x - x.mean()

Let's see how it works in practice:

In [15]:
layer = CenteredLayer()
layer(np.array([1,2,3,4,5]))

array([-2., -1.,  0.,  1.,  2.])

We can also use it to construct more complex models:

In [16]:
net = nn.Sequential()
net.add(nn.Dense(128), CenteredLayer())
net.initialize()

Let's see whether the centering layer did its job. For that we send a random data through the network and check whether the mean vanishes. Note that since we are dealing with floating point numbers, we are going to see a very small albeit typically nonzero number.

In [17]:
y = net(np.random.uniform(size = (4, 8)))

In [18]:
y.mean()

array(1.4551915e-11)

#### 5.4.2 Layers with Parameters

Now that we know how to define layers in principle, let's define layers with parameters. These can be adjusted through training. In order to simplify things for an avid deep learning researcher the Parameter class and ParameterDict dictionary provide some basic housekeeping functionality. In particular, they govern access, initialization, sharing, saving and loading model parameters. **For instance, this way we do not need to write custom serialization routines for each new custom layer.**

For instance, we can use the member variable `params` of the `ParameterDict` type that comes with the Block class. It is a dictionary that maps string type parameter names to model parameters in the Parameter type. We can create a `Parameter` instance from `ParameterDict` via the get function.

In [21]:
params = gluon.ParameterDict()
params.get('param2', shape = (2,3))
params

(
  Parameter param2 (shape=(2, 3), dtype=<class 'numpy.float32'>)
)

Let's use this to implement our own version of the Dense layer. It has two parameters: bias and weight. To make it a bit nonstandard, we kae in the ReLU activation function as default. Next, we implement a fully connected layer with both weight and bias parameters. It uses ReLU as an activation function, where `in_units` and `units` are the member of inputs and the number of outputs, respectively.

In [22]:
class MyDense(nn.Block):
    '''
    units: the number of outputs in this layer
    in_units: the number of inputs in this layer
    '''
    def __init__(self, units, in_units, **kwargs):
        super(MyDense, self).__init__(**kwargs)
        self.weight = self.params.get('weight', shape = (in_units, units))
        self.bias = self.params.get('bias', shape = (units,))
        
        
    def forward(self, x):
        linear = np.dot(x, self.weight.data()) + self.bias.data()
        return npx.relu(linear)

Naming the parameters allows us to access them by name through a dictionary lookup later. It is a good idea to give them instructive names. Next, we instantiate the `MyDense` class and access its model parameters.

In [23]:
dense = MyDense(units = 3, in_units = 5)
dense.params

mydense0_ (
  Parameter mydense0_weight (shape=(5, 3), dtype=<class 'numpy.float32'>)
  Parameter mydense0_bias (shape=(3,), dtype=<class 'numpy.float32'>)
)

We can directly carry out forward calculations using custom layers.

In [24]:
dense.initialize()
dense(np.random.uniform(size = (2,5)))

array([[0.        , 0.        , 0.00402145],
       [0.        , 0.        , 0.        ]])

We can also construct models using custom layers. Once we have that we can use it just like the built-in dense layer. The only exception is that in our case size inference **is not automatic**.

In [27]:
net = nn.Sequential()
net.add(MyDense(8, in_units=64),
        MyDense(1, in_units=8))
net.initialize()
net(np.random.uniform(size = (2, 64)))

array([[0.00722512],
       [0.00719882]])

### 5.5 File I/O

So far we discussed how to process data, how to build, train and test deep learning models. However, at some point we are likely happy with what we obtained and we want to save the results for later use and distribution. **Likewise, when running a long training process it is best practice to save intermediate results (checkpointing) to ensure that we do not lose several days worth of computation when tripping over the power cord of our server**. At the same time, we might want to load a pre-trained model (e.g., we might have word embeddings for English and use it for our fancy spam classifier). For all of these cases we need to load and store both individual weight vectors and entire models. This section addresses both issues.

#### 5.5.1 Loading and Saving `ndarray`s

In its simplest form, we can directly use the load and save functions to store and read `ndarrays` separately. This works just as expected.

In [28]:
from mxnet import np, npx
from mxnet.gluon import nn

npx.set_np()

In [29]:
x = np.arange(4)
npx.save('x-file', x)

Then, we read the data from the stored file back into memory.

In [30]:
x2 = npx.load('x-file')
x2

[array([0., 1., 2., 3.])]

It is also possible to store a list of ndarrays and read them back into memory.

In [32]:
y = np.zeros(4)
npx.save('x-files', [x,y])
x2, y2 = npx.load('x-files')
(x2, y2)

(array([0., 1., 2., 3.]), array([0., 0., 0., 0.]))

We can even write and read a dictionary that maps from a string to an ndarray. This is convenient, for instance when we want to read or write all the weights in a model

In [33]:
mydict = {'x': x, 'y': y}
npx.save('mydict', mydict)
mydict2 = npx.load('mydict')
mydict2

{'x': array([0., 1., 2., 3.]), 'y': array([0., 0., 0., 0.])}

#### 5.5.2 Gluon Model Parameters

Saving individual weight vectors (or other ndarray tensors) is useful but it gets very tedious if we want to save (and later load) an entire model. After all, we might have hundreds of parameter groups sprinkled throughout. Writing a script that collects all the terms and matches them to an architecture is quite some work. For this reason Gluon provides built-in functionality to load and save entire networks rather than just single weight vectors. An important detail to note is that this saves model `parameters` and **not the entire model**. I.e. if we have a 3 layer MLP we need to specify the *architecture* separately. The reason for this is that the models themselves can contain *arbitrary* code, hence they cannot be serialized quite so easily (there is a way to do this for compiled models). The result is that in order to reinstate a model we need to generate the architecture in code and then load the parameters from disk. The deferred initialization is quite advantageous here since we can simply define a model without the need to put actual values in place. Let's start with our favorite MLP.

In [34]:
class MLP(nn.Block):
    def __init__(self, **kwargs):
        super(MLP, self).__init__(**kwargs)
        self.hidden = nn.Dense(256, activation='relu')
        self.output = nn.Dense(10)
        
    def forward(self, x):
        return self.output(self.hidden(x))

In [35]:
net = MLP()
net.initialize()
x = np.random.uniform(size = (2,20))
y = net(x)

Next, we store the parameters of the model as a file with the name `mlp.params`

In [36]:
net.save_parameters('mlp.params')

To check whether we are able to recover the model, we instantiate a clone of the original MLP model. Unlike the random initialization of model parameters, here we read the parameters stored in the file directly.

In [37]:
clone = MLP()
clone.load_parameters('mlp.params')

Since both instances have the same model parameters, the computation result of the same input `x` should be the same. Let's verify this:

In [38]:
yclone = clone(x)
yclone == y

array([[ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [ True,  True,  True,  True,  True,  True,  True,  True,  True,
         True]])