# 5. Deep Learning Computation

### 5.1 Layers and Blocks

When we first started talking about neural nets, we introduced linear models with a single output. Here, the entire model consists of just a single neuron. By itself a single neuron takes some set of inputs, generates a corresponding (scalar) output, and has a set of associated parameters that can be updated to optimize some objective function of interest. Then, once we started thinking about networks with multiple outputs, we leveraged vectorized arithmetic, we showed how we could use linear algebra to efficiently express an entire layer of neurons. Layers too expect some inputs, generate corresponding outputs, and are described by a set of tunable parameters.

When we worked through softmax regression, a single layer was itself the model. However, when we subsequently introduced multilayer perceptrons, we developed models consisting of multiple layers. One interesting property of multilayer neural networks is that the *entire model* and its *constituent layers* share the same basic structure. The model takes the true inputs (as stated in the problem formulation), outputs predictions of the true outputs, and possesses parameters (the combined set of all parameters from all layers). Likewise any individual constituent layer in a multilayer perceptron ingests inputs (supplied by the previous layer), generates outputs (which form the inputs to the subsequent layer), and possesses a set of tunable parameters that are updated with respect to the ultimate objective (using the signal that flows backwards through the subsequent layers).

While you might think that neurons, layers, and models give us enough abstractions to go about our business, it turns out that we will often want to express our model in terms of components that are larger than an individual layer. For example, when designing models, like ResNet-152, which possess hundreds (152, thus the name) of layers, implementing the network one layer at a time can grow tedious. Moreover, this concern is not just hypothetical - such deep networks dominate numerous application areas, especially when training data is abundant.

To facilitate the implementation of networks consisting of components of arbitrary complexity, we introduce a new flexible concept: a neural network *block*. A block could describe a single neuron, a high-dimensional layer, or an arbitrarily-complex component consisting of multiple layers. From a software development perspective, a `Block` is a class. Any subclass of Block must define a method called `forward` that transforms its input into output, and must store any necessary parameters. Note that some Blocks do not require parameters at all! Finally, a Block must possess a `backward` method, for purposes of calculating gradients. Fortunately, due to some behind-the-scenes magic supplied by the autograd package defining our own Block typically requires only that we worry about parameters and the forward function.

To recap, let's revisit the Blocks that played a role in the implementation of the multilayer perceptron. The following code generates a network with one fully-connected hidden layer containing 256 units followed by a ReLU activation function, and then another fully-connected layer consisting of 10 units (with no activation function, since this is the output layer).

In [1]:
from mxnet import np, npx
from mxnet.gluon import nn

npx.set_np()

In [2]:
x = np.random.uniform(size = (2,20))

In [3]:
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

array([[ 0.06240272, -0.03268594,  0.02582652,  0.02254181, -0.03728798,
        -0.04253786,  0.00540613, -0.01364184, -0.09915452, -0.02272737],
       [ 0.02816678, -0.03341203,  0.03565667,  0.02506383, -0.04136416,
        -0.04941843,  0.01738529,  0.01081963, -0.09932576, -0.01176296]])

In this example, our model consists of an object returned by the `nn.Sequential` constructor. **After instantiating a `nn.Sequential` and storing the net variable, we repeatedly called its `add` method, appending layers in the order that they should be executed**.

**In short, `nn.Sequential` just defines a special kind of `Block`. Specifically, an `nn.Sequential` maintains a list of constituent Blocks, stored in a particular order. You might think of `nn.Sequential` as a type of meta-block. The `add` method simply facilitates the addition of each successive Block to the list**. Note that each of our layers are instances of the `Dense` class which is itself a subclass of `Block`. The `forward` function is also remarkably simple: it chains each Block in the list together, passing the output of each as the input to the next.

Before we dive in to implementing a custom block, let's briefly summarize the basic functionality that each `Block` must perform:

- 1) Ingest input data as arguments to its forward function
- 2) Generate an output via the value returned by its forward function. Note that the output may have a different shape from the input. For example, the first Dense layer in our model above ingests an input of arbitrary dimension but returns an output of dimension 256.
- 3) **Calculate the gradient of its output with respect to its input, which can be accessed via its backward method. Typically this happens automatically**
- 4) Store and provide access to those parameters necessary to execute the forward computation.
- 5) Initialize these parameters as needed.

### 5.1.1 A Custom Block

Perhaps the easiest way to develop intuition about how `nn.Block` works is to just dive right in and implement one ourselves. In the following snippet, instead of relying on the `nn.Sequential`, we just code up a block from scratch that implements a multilayer perceptron with one hidden layer, 256 hidden nodes, and 10 outputs.

Our MLP class below inherits the Block class. While we rely on some predefined methods in the parent class, we need to supply our own `__init__` and forward functions to uniquely define the behavior of our model

In [5]:
from mxnet.gluon import nn

In [6]:
class MLP(nn.Block):
    # Declare a layer with model parameters. Here, we declare two fully
    # connected layers
    def __init__(self, **kwargs):
        # Call the constructor of the MLP parent class Block to perform the 
        # necessary initialization. In this way, other function parameters can
        # also be specified when constructing an instance, such as the model parameters,
        # params, described in the following sections
        super(MLP, self).__init__(**kwargs)
        self.hidden = nn.Dense(256, activation='relu')
        self.output = nn.Dense(10)
        
    # Define the forward computation of the model, that is, how to return the
    # required model output based on the input x
    def forward(self, x):
        return self.output(self.hidden(x))

This code may be easiest to understand by working backwards from forward. Note that the forward method takes as input `x`. The forward method first evaluates `self.hidden(x)` to produce the hidden representation, passing this output as the input to the output layer `self.output(...)` 

The constituent layers of each MLP **must be instance-level variables**. Afrter all, if we instantiated two such models `net1` and `net2` and trained them on different data, we would expect them to represent two different learned models.


The `__init__` method is the most natural place to instantiate the layers that we subsequently invoke on each call to the forward method. Note that before getting on with the interesting parts, our customized `__init__` method must invoke the parents class's init method: `super(MLP, self)`. `__init__(**kwargs)` to save us from reimplementing boilerplate code applicable to most Blocks. Then, all that is left, is to instantiate our two Dense layers, assigning them to `self.hidden` and `self.output`, respectively. Again note that when dealing with standard functionality like this, we do not have to worry about backpropagation, since the `backward` method is generated for us automatically. The same goes for the `initialize` method. Let's try it out:

In [8]:
net = MLP()
net.initialize()
net(x)

array([[-0.03989593, -0.10414708,  0.06799038,  0.05245075,  0.02526059,
        -0.00640342,  0.04182098, -0.01665319, -0.02067345, -0.07863817],
       [-0.03612846, -0.07210436,  0.0915948 ,  0.0789077 ,  0.02494172,
        -0.01028664,  0.01732428, -0.02843242,  0.03772651, -0.06671704]])

As argued earlier, the primary virtue of the Block abstraction is its versatility. We can subclass `Block` to create layers (such as the Dense class provided by Gluon), entire models (such as the MLP class implemented above), or various components of intermediate complexity, a pattern that we will lean on heavily throughout the next chapters on CNN's. 

#### 5.1.2 The Sequential Block

As we described earlier, the Sequential class is also just a subclass of Block, **designed specifically for daisy-chaining other Blocks together**. All we need to do to implement our own MySequential block is to define a few convenience functions: 
- 1) An `add` method for appending Blocks one by one to a list.
- 2) A `forward` method to pass inputs through the chain of Blocks (in the order of addition)

The following MySequential class delivers the same functionality as Gluon's default Sequential class:

In [10]:
class MySequential(nn.Block):
    def __init__(self, **kwargs):
        super(MySequential, self).__init__(**kwargs)
        
    def add(self, block):
        # Here, block is an instance of a Block subclass, and we assume it has
        # a unique name. We save it in the member variable _children of the Block
        # class, and its type is OrderedDict. When the MySequential instance calls
        # the initialize function, the system automatically initializes all members of
        # children
        self._children[block.name] = block
        
    def forward(self, x):
        # OrderedDict guarantees that members will be traversed in the order
        # they were added.
        for block in self._children.values():
            x = block(x)
        return x

At its core is the `add` method. It adds any block to the ordered dictionary of children. These are then executed in sequence when forward propagation is invoked. Let's see what the MLP looks like now:

In [28]:
net = MySequential()
net.add(nn.Dense(256,activation = 'relu'))
net.add(nn.Dense(10))
net.initialize()
net(x)

array([[-0.0764568 , -0.01130233,  0.04952145, -0.04651388, -0.04131571,
        -0.05884131, -0.06213811,  0.01311471, -0.01379425, -0.02514282],
       [-0.05124623,  0.00711232, -0.00155934, -0.07555378, -0.06675333,
        -0.01762913,  0.00589084,  0.01447191, -0.04330775,  0.03317727]])

#### 5.1.3 Blocks with Code

Although the Sequential class can make model constructions easier, and you do not need to define the forward method, directly inheriting the Block class can greatly expand the flexibility of model construction. In particular, we will use Python's contril flow within the forward method. While we are at it, we need to introduce another concept, that of the *constant* parameter. **These are parameters that are not used when invoking backpropagation**. This sounds very abstract, but here's what is really going on. Assume that we have some function:

- $f(\textbf{x}, \textbf{w}) = 3 * \textbf{w}^T\textbf{x}$

In this case 3 is a constant parameter. We could change 3 to something else, say c, via

- $f(\textbf{x}, \textbf{w}) = c * \textbf{w}^T\textbf{x}$

Nothing has really changed, except that we can adjust the value of *c*. It is still a constant as far as $\textbf{w}$ and $\textbf{x}$ are concerned. However, since Gluon does not know about this beforehand, it is worth while to give it a hand (this makes the code go faster, since we are not sending the Gluon engine on a wild goose chase after a parameter that does not change). `get_constant` is the method that can be used to accomplish this. Let's see what this looks like in practice.

In [29]:
class FancyMLP(nn.Block):
    def __init__(self, **kwargs):
        super(FancyMLP, self).__init__(**kwargs)
        # Random weight parameters created with get_constant ARE NOT
        # ITERATED during training
        self.rand_weight = self.params.get_constant(
            'rand_weight', np.random.uniform(size = (20,20)))
        self.dense = nn.Dense(20, activation='relu')
        
    def forward(self, x):
        x = self.dense(x)
        # Use the constant parameters created, as well as the relu
        # and dot functions
        x = npx.relu(np.dot(x, self.rand_weight.data()) + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        x = self.dense(x)
        # Here in control flow, we need to call as scalar to return the scalar
        # for comparison
        while np.abs(x).sum() > 1:
            x /= 2
        if np.abs(x).sum() < 0.8:
            x *= 10
        return x.sum()

In this `FancyMLP` model, we used constant weight Rand_weight (**note that it is not a model parameter**), performed a matrix multiplication operation (np.dot), and reused the same Dense layer. Note that this is very different from using two dense layers with different sets of parameters. Instead, we used the same network twice. **Quite often in deep networks one also says that the parameters are *tied* when one wants to express that multiple parts of a network share the same parameters**. Let's see what happens if we construct our FancyMLP and feed data through it.

In [30]:
net = FancyMLP()
net.initialize()
net(x)

array(5.2637568)

There is no reason why we couldn't mix and match these ways of building a network. Obviously, the next example resembles more a chimera, or less charitably, a Rube Goldberg Machine. That said, it combines examples for building a block from individual blocks, which in turn, may be blocks themselves. Furthermore, we can even combine multiple strategies inside the same forward function. To demonstrate this, here's the network.

In [31]:
class NestMLP(nn.Block):
    def __init__(self, **kwargs):
        super(NestMLP, self).__init__(**kwargs)
        self.net = nn.Sequential()
        self.net.add(nn.Dense(64, activation = 'relu'),
                     nn.Dense(32, activation = 'relu'))
        self.dense = nn.Dense(16, activation = 'relu')
        
    def forward(self, x):
        return self.dense(self.net(x))

In [32]:
chimera = nn.Sequential()
chimera.add(NestMLP(), nn.Dense(20), FancyMLP())

In [33]:
chimera.initialize()
chimera(x)

array(0.9772054)

#### Summary:

- 1) Layers are blocks
- 2) Many layers can be a block
- 3) Many blocks can be a block
- 4) Code can be a block
- 5) Blocks take care of a lot of housekeeping, such as parameter initialization, backprop and related issues
- 6) Sequential concatenations of layers and blocks are handled by the eponymous Sequential block.

#### 5.2 Parameter Management

The ultimate goal of training deep networks is to find good parameter values for a given architecture. When everything is standard, the nn.Sequential class is a perfectly good tool for it. However, very few models are entirely standard and most scientists want to build things that are novel. This section shows how to manipulate parameters. In particular, the following aspects will be covered:

- Accessing parameters for debugging, diagnostics, to visualize them or to save them is the first step to understanding how to work with custom models.
- Second, we want to set them in specific ways, e.g., for initialization purposes. Se discuss the structure of parameter initializers.
- Last, we show how this knowledge can be put to good use by building networks that share some parameters.


Let's start with our trusty Multilayer Perceptron with a hidden layer. This will serve as our choice for demonstrating the features discussed above.

In [1]:
from mxnet import init, np, npx
from mxnet.gluon import nn

npx.set_np()

In [2]:
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
net.initialize() # using the default initialization method

In [3]:
x = np.random.uniform(size = (2,20))
net(x) # forward pass

array([[ 0.06240272, -0.03268594,  0.02582652,  0.02254181, -0.03728798,
        -0.04253786,  0.00540613, -0.01364184, -0.09915452, -0.02272737],
       [ 0.02816678, -0.03341203,  0.03565667,  0.02506383, -0.04136416,
        -0.04941843,  0.01738529,  0.01081963, -0.09932576, -0.01176296]])

#### 5.2.1 Parameter Access

In the case of a Sequential class we can access the parameters with ease, simply by indexing each of the layers in the network. The params variable then contains the required data. Let's try this out in practice by inspecting the parameters of the first layer.

In [4]:
print(net[0].params)
print(net[1].params)

dense0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
)
dense1_ (
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)


**Note**: I couldn't wrap my head (or rather imagine) around the shapes of the weights that are presented above, so I took a deeper dive to correctly understand the forward pass. Okay, so why the parameter's dense12_weight shape is 256, 20? Because we specified in our nn.Dense 256 hidden units, and 20 because of the input features. So, 256 nodes "from" 20 features that were passed from the input layer. Next, why dense13_weight has the shape (10, 256)? Because we specified that we want 10 outputs, adn 256 because it is "coming from" the previous layer that had 256 hidden units.

The output tells us a number of things. First, the layer consists of two sets of parameters: `dense0_weight` and `dense0_bias`, as we would expect. They are both single precision and they have the necessary shapes that we would expect from the first layer, given that the input dimension is 20 and the output dimension is 256. In particular the names of the parameters are very useful since they allow us to identify parameters *uniquely* even in a network of hundreds of layers and with nontrivial structure. The second layer is structured accordingly.

#### Targeted Parameters

In order to do something useful with the parameters we need to access them, though. There are several ways to do this, ranging from simple to general. Let's look at some of them.

In [5]:
print(net[1].bias)
print(net[1].bias.data())

Parameter dense1_bias (shape=(10,), dtype=float32)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


The first returns the bias of the second layer. Since this is an object containing data, gradients, and additional information, **we need to request the data explicitly**. Note that the bias is all 0 since we initialized the bias to contain all zeros. Note that we can also access the parameters by name, such as 'dense0_weight'. This is possible since each layer comes with its owen parameter dictionary that can be accessed directly. Both methods are entirely equivalent but the first method leads to much more readable code. 

In [8]:
print(net[0].params['dense0_weight'])
print(net[0].params['dense0_weight'].data())

Parameter dense0_weight (shape=(256, 20), dtype=float32)
[[ 0.06700657 -0.00369488  0.0418822  ... -0.05517294 -0.01194733
  -0.00369594]
 [-0.03296221 -0.04391347  0.03839272 ...  0.05636378  0.02545484
  -0.007007  ]
 [-0.0196689   0.01582889 -0.00881553 ...  0.01509629 -0.01908049
  -0.02449339]
 ...
 [-0.02055008 -0.02618652  0.06762936 ... -0.02315108 -0.06794678
  -0.04618235]
 [ 0.02802853  0.06672969  0.05018687 ... -0.02206502 -0.01315478
  -0.03791244]
 [-0.00638592  0.00914261  0.06667828 ... -0.00800052  0.03406764
  -0.03954004]]


Note that the weights are nonzero. This is by design since they were randomly initialized when we constructed the network. `data` is not the only function that we can invoke. For instance, we can compute the gradient with respect to the parameters. It has the same shape as the weight. However, since we did not invoke backpropagation yet, the values are all 0.

In [9]:
net[0].weight.grad()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

#### All Parameters at Once

Accessing parameters as described above can be a bit tedious, in particular if we have more complex blocks, or blocks of blocks (or even blocks of blocks of blocks), since we need to walk through the entire tree in reverse order to how the blocks were constructed. To avoid this, blocks come with a method `collect_params()` which grabs all parameters of a network in one dictionary such that we can traverse it with ease. It does so by iterating over all constituents of a block and calls collect_params() on subblocks as needed. To see the difference consider the following:

In [10]:
# parameters only for the first layer
print(net[0].collect_params())
# parameters of the entire network
print(net.collect_params())

dense0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
)
sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
  Parameter dense1_bias (shape=(10,), dtype=float32)
)


This provides us with a third way of accessing the parameters of the network. If we wanted to get the value of the bias term of the second layer we could simply use this:

In [12]:
net.collect_params()['dense1_bias'].data()

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

We are going to see how various blocks name their subblocks throughout (Sequential simply numbers them). This makes it very convenient to use regular expressions to filter out the required parameters.

In [14]:
print(net.collect_params('.*weight'))
print(net.collect_params('dense0.*'))

sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense1_weight (shape=(10, 256), dtype=float32)
)
sequential0_ (
  Parameter dense0_weight (shape=(256, 20), dtype=float32)
  Parameter dense0_bias (shape=(256,), dtype=float32)
)


#### Rube Goldberg Strikes Again

Let's see how the parameter naming conventions work if we nest multiple blocks inside each other. For that we first define a function that produces blocks (a block factory, so to speak) and then we combine these inside yet larger blocks.

In [17]:
def block1():
    net = nn.Sequential()
    net.add(nn.Dense(32, activation='relu'))
    net.add(nn.Dense(16, activation='relu'))
    return net


def block2():
    net = nn.Sequential()
    for i in range(4):
        net.add(block1())
    return net

In [18]:
rgnet = nn.Sequential()
rgnet.add(block2())
rgnet.add(nn.Dense(10))
rgnet.initialize()
rgnet(x)

array([[-4.1923052e-09,  1.9830513e-09,  8.9443941e-10,  6.2912981e-09,
        -3.3241803e-09,  5.4330047e-09,  1.6013488e-09, -3.7408676e-09,
         8.5468486e-09, -6.4805539e-09],
       [-3.7507055e-09,  1.4866971e-09,  6.8314709e-10,  5.6925775e-09,
        -2.6349158e-09,  4.8626658e-09,  1.4280280e-09, -3.4603027e-09,
         7.4127913e-09, -5.7896128e-09]])

In [19]:
print(rgnet.collect_params)
print(rgnet.collect_params())

<bound method Block.collect_params of Sequential(
  (0): Sequential(
    (0): Sequential(
      (0): Dense(20 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (1): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (2): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
    (3): Sequential(
      (0): Dense(16 -> 32, Activation(relu))
      (1): Dense(32 -> 16, Activation(relu))
    )
  )
  (1): Dense(16 -> 10, linear)
)>
sequential3_ (
  Parameter dense2_weight (shape=(32, 20), dtype=float32)
  Parameter dense2_bias (shape=(32,), dtype=float32)
  Parameter dense3_weight (shape=(16, 32), dtype=float32)
  Parameter dense3_bias (shape=(16,), dtype=float32)
  Parameter dense4_weight (shape=(32, 16), dtype=float32)
  Parameter dense4_bias (shape=(32,), dtype=float32)
  Parameter dense5_weight (shape=(16, 32), dtype=float32)
  Parameter dense5_

Since the layers are hierarchically generated, we can also access them accordingly. For instance, to access the first major block, within it the second subblock and then within it, in turn the bias of the first layer, we perform the following:

In [22]:
rgnet[0][1][0].bias.data()

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

#### 5.2.2 Parameter Initialization

Now that we know how to access the parameters, let's look at how to initialize them properly. We discussed the need for initialization in previous sections. By default, MXNet initializes the weight matrices **uniformly by drawing from** $U = [-0.07, 0.07]$ and the bias parameters are all set to 0. However, we often need to use other methods to initialize the weights. MXNet's `init` module provides a variety of preset initialization methods, but if we want something out of the ordinary, we need a bit of extra work

#### Built-in Initialization