## Initialization
Initialization is simply what values we set the parameters of the layers to, when we create the network. Several papers have shown that the initialization method of a network's parameters greatly influence the speed and the ability of the network to train properly and convert.

In [1]:
import mxnet as mx
from mxnet import init, nd
from mxnet.gluon import nn

### Different types of initialization
- By default, MXNet initializes the weight matrices uniformly, by drawing from a uniform distribution between -0.07
and 0.07. And the bias parameters are all set to zero.
- However, we often need to use other methods to initialize weights.
- MXNet provides support for a lot of different initialization schemes from simple constant initialization, like one or zero to randomly uniform weights. And more complex ones like Xavier initialization that takes into account the number of inputs and outputs of a given layer to scale the weights' magnitude. 

In [None]:
# Simple
init.One
init.Zero
# Randomly uniform weights
init.Uniform
init.Normal
init.Xavier
init.Constant

Xavier Initialization, a very popular initialization scheme for training neural networks. 

In [2]:
# creates a Conv2D layer with one input channel and one output channel and kernel size three by three.
layer = nn.Conv2D(channels = 1, in_channels=1, kernel_size=(3,3))
layer.initialize(init.Xavier())
layer.weight.data()


[[[[ 0.05636501  0.10720772  0.24847925]
   [ 0.39752382  0.11866093  0.41332   ]
   [ 0.05182666  0.4009717  -0.08815584]]]]
<NDArray 1x1x3x3 @cpu(0)>

Xavier Initialization large number of output

In [3]:
# Increase the number of input and output channels
layer = nn.Conv2D(channels = 512, in_channels=512, kernel_size=(3,3))
layer.initialize(init.Xavier())
layer.weight.data()[0]


[[[ 0.00630558  0.00744513 -0.00590012]
  [-0.00318499 -0.01033202  0.01999258]
  [-0.0226214   0.02366119 -0.01160159]]

 [[-0.0059481  -0.00113977  0.01488703]
  [ 0.01593029  0.00147454 -0.00102179]
  [ 0.00347238 -0.0054713   0.02171864]]

 [[ 0.01715045 -0.02189048 -0.00829784]
  [-0.02106922  0.00756137 -0.02448375]
  [-0.00672377  0.01697394  0.0233291 ]]

 ...

 [[-0.01893761  0.01284147 -0.00975751]
  [-0.02142992 -0.00148998 -0.00093406]
  [ 0.00985651 -0.00277236  0.00710726]]

 [[ 0.00880146 -0.01043105 -0.00261599]
  [-0.00773894  0.01042632 -0.01882804]
  [ 0.00926955  0.00075613  0.01006069]]

 [[ 0.00549796  0.00605232 -0.0150173 ]
  [-0.01781082  0.02513626  0.01330902]
  [-0.01826004  0.01434206  0.0111638 ]]]
<NDArray 512x3x3 @cpu(0)>

### Deferred Initialization
- This mechanism allows you to only define the number of outputs for each layer. And the number of input is completed automatically during the first pass of the data through that layer.

In [4]:
# create a layer without defining the number of input channels.
layer = nn.Conv2D(channels=1, kernel_size=(3,3))
layer.weight

Parameter conv2_weight (shape=(1, 0, 3, 3), dtype=<class 'numpy.float32'>)

we can see that parameter of shape is 1, 0, 3, 3. This means that we still don't know how many inputs this conditional layer will accept, and hence, we don't know the value of the depth dimension of the kernel.


If we run a batch of data through the layer of size 1 by 8 by 224 by 224, like this, the layer detects that the input has 8 channels and initialize the weights according to the initialization rule and with the right shape.

In [5]:
layer.initialize(init.Xavier())
layer(mx.nd.ones((1,8,224,224)))
layer.weight

Parameter conv2_weight (shape=(1, 8, 3, 3), dtype=<class 'numpy.float32'>)

### Initialization Context
Very important concept in gluon, if that innate work is on a specific compute context. If the network weights were presented as in the arrays on the compute context, then the network is said to be on that context. To give you an example, if you initialize all the ways of a network on one GPU, then that means that your network is on that GPU. You can then run inference by running data located on the same compute context, CPU or GPU as your network weights and the output will be on that compute context as well. For example, here we can see that the outputs of this Conv2D layer after having data allocated on the GPU, the output of this layer is also on GPU.

In [7]:
layer = nn.Conv2D(channels=1, kernel_size=(3,3))
layer.initialize(init.Xavier(), ctx=mx.cpu())

In [9]:
layer(nd.uniform(shape=(1,3,9,9), ctx=mx.cpu()))


[[[[-0.40201202 -0.7018695   0.01785277  0.5870792   0.2730119
    -0.28182274 -0.04584998]
   [-0.2916031  -0.528964   -0.29903355  0.07407004  0.0802862
     0.3445815  -0.13694511]
   [-0.59821695 -0.02556936  0.23767231 -0.13543423 -0.5430629
    -0.27945024 -0.20130056]
   [-0.30110553 -0.65429443 -0.60643345 -0.1521534   0.6870324
     0.33310294  0.12388384]
   [-0.5037831   0.38428587  0.40459839  0.13650624 -0.41439134
    -0.16389096 -0.5943488 ]
   [-0.3006251  -0.0619027   0.0304675  -1.034457   -0.04781772
     0.00905298 -0.41354978]
   [-0.48301694 -0.08969672  0.18813546  0.59333855  0.35146442
     0.1964373  -0.61889297]]]]
<NDArray 1x1x7x7 @cpu(0)>

### Set Data

In [None]:
layer.weight.set_data(nd.ones((1,3,3,3), ctx=mx.gpu()))
layer.weight.data()