In [62]:
from parser_transitions import *
from run import *
import numpy as np

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## `minibatch_parse`

### `DummyModel`

First question is what `DummyModel` doing? Actually there's a good description of what it's doing in comments. Let's just check that it's correct.

In [3]:
sentences = [["right", "arcs", "only"],
             ["right", "arcs", "only", "again"],
             ["left", "arcs", "only"],
             ["left", "arcs", "only", "again"]]

In [21]:
pp = PartialParse(sentence=sentences[0]) 

In [19]:
dm = DummyModel()

In [22]:
while True:
    t = dm.predict([pp])
    print(pp.stack, pp.buffer, t)
    pp.parse_step(t[0])
    if pp.is_empty():
        print(pp.stack, pp.buffer, [])
        break

['ROOT'] ['right', 'arcs', 'only'] ['S']
['ROOT', 'right'] ['arcs', 'only'] ['S']
['ROOT', 'right', 'arcs'] ['only'] ['S']
['ROOT', 'right', 'arcs', 'only'] [] ['RA']
['ROOT', 'right', 'arcs'] [] ['RA']
['ROOT', 'right'] [] ['RA']
['ROOT'] [] []


### `test_minibatch_parse()`

In [25]:
deps = minibatch_parse(sentences=sentences,
                       model=DummyModel(),
                       batch_size=2)

In [26]:
deps

[[('arcs', 'only'), ('right', 'arcs'), ('ROOT', 'right')],
 [('only', 'again'), ('arcs', 'only'), ('right', 'arcs'), ('ROOT', 'right')],
 [('only', 'arcs'), ('only', 'left'), ('only', 'ROOT')],
 [('again', 'only'), ('again', 'arcs'), ('again', 'left'), ('again', 'ROOT')]]

Is this correct parsing? Let's look at `["right", "arcs", "only"]`. We should have only right arcs here: `arcs -> only`, `right -> arcs` and `root -> right`. And that's exactly what we have.

In [30]:
pdf_sentences = [['I', 'parsed', 'this', 'sentence', 'correctly']]

Let's build a model that can parse this sentence from `pdf`. 

## `ParserModel`

### `Embedding layer`

First question - what is the shape of our `Embedding` layer and how do we set pre-computed weights? As usual shape is `(vocab_size, embed_size)` - see below. 

It seems that the proper method to use pre-computed embeddings is to use `from_pretrained()`. It looks like the approach from provided code is qustionable. In particular we don't freeze training as we can see below. 

From documantation:
- `freeze (boolean, optional) – If True, the tensor does not get updated in the learning process. Equivalent to embedding.weight.requires_grad = False. Default: True`

In [35]:
debug = True

In [41]:
parser, embeddings, train_data, dev_data, test_data = load_and_preprocess_data(reduced=True)

Loading data...
took 2.84 seconds
Building parser...
took 0.04 seconds
Loading pretrained embeddings...
took 3.00 seconds
Vectorizing data...
took 0.08 seconds
Preprocessing training data...
took 1.50 seconds


In [42]:
embeddings.shape

(5157, 50)

In [47]:
embedding_layer = nn.Embedding(*embeddings.shape)

In [48]:
embedding_layer.weight.shape

torch.Size([5157, 50])

In [50]:
embedding_layer.weight.requires_grad

True

In [51]:
embedding_layer.weight = nn.Parameter(torch.tensor(embeddings))

In [52]:
embedding_layer.weight.requires_grad

True

Next question - how do feed data in `Embedding` layer and then to `Linear` layer. It seems that the input is a list of integers where an integer represents a token in vocabulary as usual. Output of `Embedding` layer is `(batch_size, seq_len, embed_size)`. In our case `seq_len` is just a number of features.

To feed data into `Linear` layer we need to flatten them into `2D`. And we can do it like `(32, 8 * 50)` or `(32 * 8, 50)`. What's better? It seems I used both of these methods but in this assignment they recommend the first option.

In [66]:
t = torch.LongTensor(np.arange(256).reshape(32, 8))

In [64]:
t_embed = embedding_layer(t)

In [65]:
t_embed.shape

torch.Size([32, 8, 50])

### xavier init

So we're supposed to use `Xavier` initialization in this assignment. We have the following questions:

- is it the same as `Glorot` init?
- what is the difference with the default init for a `Linear` layer?
- what is better to use with `relu` activation function?
- what is `He` init?
- what is code in `pytorch`?
- finally, what is math behind this?

It looks like the default init in `pytorch` is not `He` init. So in case we use `relu activation` we have to change this init.   

#### is it the same as Glorot init?

First of all this init is based on *Understanding the difficulty of training deep feedforward neural networks - Glorot, X. & Bengio, Y. (2010)* and the leading author is Xavier Glorot. 

#### what is the difference with the default init for a `Linear` layer?

In `pytorch` docs we may see that the values are initialized from $U(-\sqrt{k}, \sqrt{k})$ where $k = 1 / f_{in}$. This is precisely heuristic that is used for comparison in the article (`eq. 1`). 

`Xavier` init is from $U(-a, a)$ where $a=\sqrt{\frac{6}{f_{in}+f_{out}}}$. This is specified in `pytorch` docs and in the paper (`eq. 16`). In `pytorch` there's additional parameter: `gain`. So $a=gain \cdot \sqrt{\frac{6}{f_{in}+f_{out}}}$. For `relu` $gain=\sqrt{2}$. It's not clear why they use this. 

But this is the case with `uniform` init. In case of `normal` init we have $\sigma=gain \cdot \sqrt{\frac{2}{f_{in}+f_{out}}}$ and we use $\mathcal{N}(0, \sigma^2)$. And again $gain=\sqrt{2}$ so basically we use `He` init.

In Andrew Ng's lectures we may find a variant of this init: $\sigma=\sqrt{\frac{1}{f_{in}}}$.

#### what is better to use with relu activation function?

As I understand it's better to use `He` init (for example, see Andrew Ng's lectures `C2W1L11`).

#### `He` normal init

paper: *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. He et. all (2015)*.

We have $\sigma=\sqrt{\frac{2}{f_{in}}}$ and we use $\mathcal{N}(0, \sigma^2)$ (`eq. 10` in the article).

#### what is code in pytorch?

`Xavier` `uniform`: `nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu'))`

`He` `uniform`: `nn.init.kaiming_uniform_(w)`;
`He` `normal`: `nn.init.kaiming_normal_(w)`

#### why is that?

Here's the [post](https://andyljones.tumblr.com/post/110998971763/an-explanation-of-xavier-initialization) that is mentioned in assignment.

So `Xavier` init draws from $\mathcal{N}(0, \sigma^2)$ where $\sigma^2=\frac{2}{f_{in} + f_{out}}$. This is the formula from their original paper but often used another formula: $\sigma^2=\frac{1}{f_{in}}$. 

First of all suppose $X \perp Y$ and also $\mathbb{E}(X) = \mathbb{E}(Y) = 0$. Then it's easy to show that $\mathbb{V}(XY) = \mathbb{V}(X)\mathbb{V}(Y)$. 

Let's use the formula: $\mathbb{V}(XY) = \mathbb{E}(X^2Y^2)-(\mathbb{E}(XY))^2$. First term is $\mathbb{E}(X^2)\mathbb{E}(Y^2) = \mathbb{V}(X)\mathbb{V}(Y)$ since $X^2 \perp Y^2$ and we can factorize expectation of product for independent random variables. And the second term is 0 for similar reasons.

Now consider a single neuron: $Y = W_1X_1 + ... + W_nX_n$ where $W_i \perp X_i$ and so on.

In this case we have: $\mathbb{V}(Y) = n\mathbb{V}(W)\mathbb{V}(X)$.

Now the **key assumption**: we need $\mathbb{V}(X)=\mathbb{V}(Y)$. Why is that? Probably we actually need $\mathbb{V}(X) \sim \mathbb{V}(Y)$ - in other words an input signal must not explode or shrink. 

So we have: $\mathbb{V}(W) = \frac{1}{n} = \frac{1}{n_{in}}$.

How to get the original formula? Well we need consider the backprop and take an average ...