<h1> <center>  First report of my master thesis  (Layer normalization ) </center> </h1>

During my first month (Ocotber 2018), I started working with **layer normalization** (https://arxiv.org/abs/1607.06450) in order to make my speech-recognition baseline model converge faster in term of number of steps. 

## Problematic: 

Machine learning models on Speech-recogntion tasks require a lot of data to be trained. Thus, many days are required to train and finetune the model.
In order to overcome this issue, I proposed to apply layer-normalization technique to my model. In next section, I will explain how this technique works.


## Layer normalization explained : 

To understand layer normalization, recall that a minibatch consists of multiple examples with the same number of features. 
The mini-batches are tensors where the **first axis** correspond to the **batches** and the other/others axis correspond to the features. The key feature of layer-normalization is that it normalizes over the axis of the features and not of the batches as the batch normalization normally does.

![Layer Normalization explained](images/LN-explained.png "")

    
    



## Methodology: 

 My baseline model consists of an encoder/decoder and attention-mechanisms in between. The encoder is a pyramidal-bidirectional-LSTM ( Listener ) and the decoder is an attention-based recurrent network decoder (RNN) that emits characters as outputs. 

A quick reminder that the basic LSTM equations used for these experiments are given by: 
 
 ![LSTM equations](images/lstm-equations.png "Title")


Layer normalization in our case will be applied as following: 

![LSTM equations after applying layer noramlization](images/lstm-ln-equations.png "LSTM-LN-equations")


## Experiments : 

Tensorflow already did implement the Layer-Normalization (see: [tf.contrib.rnn.LayerNormBasicLSTMCell]( https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell ) ) and I used their implementation in first place. 
I applied layer normalization in both encoder and the decoder.    

Here is the validation loss of this experiment: the blue curve corresponds to the model with Layer Normalization applied to the encoder/decoder and the orange plot corresponds to the model without LN. 
Both experiments were applied to the 100-clean-data set of the Librispeech.




![Layer normalization applied to both en/de](images/validation_en_de_layer_norm_basic.png "L")


I did another experiment, where I applied layer normalization **only to the encoder** and here is the result:




![Layer Normalization applied only to the encoder](images/validation_en_layer_norm_basic.png "Layer Normalization applied only to the encoder")

I did another experiment, where I applied layer normalization **only to the decoder** and here is the result:



![Layer normalization applied only to the decoder](images/only_decoder_BLN.png "Layer normalization applied only to the decoder")

## Investigation: 

Surprisingly, the LayerNormBasicLSMTCell failed to beat the BasicLSMTCell. <br>
To investigate why this is the case, I tried to solve small tasks using the LayerNormBasicLSTMCell to inspect its behaviour.



### MNIST task using the tf.contrib.LayerNormBasicLSTMCell

I started by working on the MNIST task. 



![Layer normalization applied to MNIST task]( images/mnist-ln.png "Layer normalization applied to MNIST task")

Weird reuslts, the loss value was equal **NAN** during all steps and the training accuracy is too small. <br>
The training ended with a **testing accuracy** of 0.078125.

The same model was trained with simple BasicLstmCell and it gave good results and ended-up with a **testing accuracy** of 0.8828125. <br> 
It is not the best result you can get on MNIST task but way more better than the performance using the LayerNormBasicLSTMCell. 


### Deep Recurrent Attention Writer task using the tf.contrib.LayerNormBasicLSTMCell

The draw task was mentionned by the autors of the LayerNormalization paper (https://arxiv.org/pdf/1607.06450.pdf).<br>
I tried to reproduce the results the paper has using the tf.contrib.rnn.BasicLSTMCell and here are the results I got:



![Layer normalization applied to DRAW task]( images/draw_ln.png "Layer normalization applied to DRAW task") 


 and here the plot of the draw task **without layer normalization**


![DRAW task without layer Normalization]( images/draw_simple.png "DRAW task without layer Normalization")  

Nearly the LayerNormalization task gave the same performance as the Baseline model. Except that it was more stable (less perturbations ) between the 15 and 100 epochs. <br>
**Question:** Why was the model here more stable and converges and our LAS model didn't? 

To answer this question, let's go back to the original paper of the DRAW task and see how the model is designed (see 
 https://arxiv.org/pdf/1502.04623v1.pdf ). <br>
The DRAW model is composed of an encoder-decoder (both RNN ) and in between there is an attention mechanism. Close design to our LAS model, except that the LAS model has a more complex encoder architecture (Pyramidal bidirectional LSTM) and of course longer and more complex sequences ( audio data vs MNIST images ).

Here is the architecture of the DRAW model



![DRAW model architecture](images/Draw.png "DRAW model architecture")  


# Propsed solutions : 

Having in mind all those information, I came back and modified the LAS model where I set up a more simple listener architecture: instead of using a Pyramidal-Bidirectional-LSTM, why not just using a simple Bidirectional-LSTM and see if I can make a difference this time.

### Modified LAS : ( Encoder as a Bidirectional LSTM with layer normalization )  

Unfortunately, even with a simpler speller I was not able to have satisfying results.<br>
But this time the model was able to converge more and the validation loss reached 1.10. 

![Encoder as a simple Bidirectional LSTM](images/bidirectional_ln.png "Encoder as a simple Bidirectional LSTM")  

### Modified LAS: ( modified Encoder wher layer normalization applied only to the Pyramidal Layers ) 

Unfortunately, this solution came with non-satisfing results.


![Layer normalization applied only to the Pyramidal layers of the encoder](images/pyramidal_ln.png "Layer normalization applied only to the Pyramidal layers of the encoder")  

## Decision :

At this step and after all those experiments, I took the decision to give up working with the tf.contrib.rnn.LayerNormBasicLSTMCell and to implement my own code for layer normalization.


Here is my implementation of the Layer-Normalization inspired by https://github.com/hardmaru/supercell: 

In [1]:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

"""Module for constructing RNN Cells."""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import math
import numpy as np
import tensorflow as tf
from tensorflow.contrib.compiler import jit
from tensorflow.contrib.layers.python.layers import layers
from tensorflow.contrib.rnn.python.ops import core_rnn_cell
from tensorflow.python.framework import constant_op
from tensorflow.python.framework import dtypes
from tensorflow.python.framework import op_def_registry
from tensorflow.python.framework import tensor_shape
from tensorflow.python.layers import base as base_layer
from tensorflow.python.ops import array_ops
from tensorflow.python.ops import clip_ops
from tensorflow.python.ops import gen_array_ops
from tensorflow.python.ops import init_ops
from tensorflow.python.ops import math_ops
from tensorflow.python.ops import nn_impl  # pylint: disable=unused-import
from tensorflow.python.ops import nn_ops
from tensorflow.python.ops import partitioned_variables  # pylint: disable=unused-import
from tensorflow.python.ops import random_ops
from tensorflow.python.ops import rnn_cell_impl
from tensorflow.python.ops import variable_scope as vs
from tensorflow.python.platform import tf_logging as logging
from tensorflow.python.util import nest
from tensorflow.python.framework import ops
from tensorflow.contrib.layers.python.layers import layers


class LN_LSTMCell(rnn_cell_impl.RNNCell):

		"""
		LSTM unit with layer normalization 
		Layer normalization implementation is based on:
		https://arxiv.org/abs/1607.06450.
	
		"""

		def __init__(
				self,
				num_units,
				forget_bias=1.0,
				input_size=None,
				reuse=None,
				use_layer_norm=True,
				use_recurrent_dropout=False, dropout_keep_prob=0.90,
				training= True,
				):
				"""
				Initializes the basic LSTM cell.
				Args:
					num_units: int, The number of units in the LSTM cell.
					forget_bias: float, The bias added to forget gates (see above).
					input_size: Deprecated and unused.
					
				"""

				super(LN_LSTMCell, self).__init__(_reuse=reuse)

				if input_size is not None:
						logging.warn('%s: The input_size parameter is deprecated.',
												 self)

				self._num_units = num_units
				self._forget_bias = forget_bias
				self._reuse = reuse
				self.use_recurrent_dropout = use_recurrent_dropout
				self.dropout_keep_prob = dropout_keep_prob
				self.use_layer_norm = use_layer_norm

		@property
		def state_size(self):
				return rnn_cell_impl.LSTMStateTuple(self._num_units,
								self._num_units)

		@property
		def output_size(self):
				return self._num_units

		def call(self, x, state):
			with tf.variable_scope("ln"):
				h, c = state

				h_size = self._num_units
	  
				batch_size = x.get_shape().as_list()[0]
				x_size = x.get_shape().as_list()[1]
				  
				w_init= None # uniform

				h_init=lstm_ortho_initializer()

				W_xh = tf.get_variable('W_xh',
					[x_size, 4 * self._num_units], initializer=w_init)

				W_hh = tf.get_variable('W_hh_i',
					[self._num_units, 4*self._num_units], initializer=h_init)

				bias = tf.get_variable('bias',
					[4 * self._num_units], initializer=tf.constant_initializer(0.0))

				
				xh = tf.matmul(x,W_xh)
				hh = tf.matmul(h,W_hh)
				ln_xh = raw_layer_norm(xh)
				ln_hh = raw_layer_norm(hh)
				concat = ln_xh + ln_hh + bias 
				#concat = xh + hh + bias
				i, j, f, o = tf.split(concat, 4, 1)
				g = tf.tanh(j) 
				new_c = c * tf.sigmoid(f + self._forget_bias) + tf.sigmoid(i)*g
				new_h = tf.tanh(layer_norm(new_c,self._num_units,scope='ln_c')) * tf.sigmoid(o)
				
				return new_h, tf.contrib.rnn.LSTMStateTuple(new_c, new_h)
		

		

		
def lstm_ortho_initializer(scale=1.0):
  def _initializer(shape, dtype=tf.float32, partition_info=None):
	size_x = shape[0]
	size_h = shape[1]//4 # assumes lstm.
	t = np.zeros(shape)
	t[:, :size_h] = orthogonal([size_x, size_h])*scale
	t[:, size_h:size_h*2] = orthogonal([size_x, size_h])*scale
	t[:, size_h*2:size_h*3] = orthogonal([size_x, size_h])*scale
	t[:, size_h*3:] = orthogonal([size_x, size_h])*scale
	return tf.constant(t, dtype)
  return _initializer

def orthogonal(shape):
	flat_shape = (shape[0], np.prod(shape[1:]))
	a = np.random.normal(0.0, 1.0, flat_shape)
	u, _, v = np.linalg.svd(a, full_matrices=False)
	q = u if u.shape == flat_shape else v
	return q.reshape(shape)


def orthogonal_initializer(scale=1.0):
	def _initializer(shape, dtype=tf.float32, partition_info=None):
		return tf.constant(orthogonal(shape) * scale, dtype)
	return _initializer


def layer_norm(x, num_units, scope="layer_norm", reuse=False, gamma_start=1.0, epsilon = 1e-5, use_bias=True):
  axes = [1]
  mean = tf.reduce_mean(x, axes, keep_dims=True)
  x_shifted = x-mean
  var = tf.reduce_mean(tf.square(x_shifted), axes, keep_dims=True)
  inv_std = tf.rsqrt(var + epsilon)
  with tf.variable_scope(scope):
    if reuse == True:
      tf.get_variable_scope().reuse_variables()
    gamma = tf.get_variable('ln_gamma', [num_units], initializer=tf.constant_initializer(gamma_start))
    if use_bias:
      beta = tf.get_variable('ln_beta', [num_units], initializer=tf.constant_initializer(0.0))
  output = gamma*(x_shifted)*inv_std
  if use_bias:
    output = output + beta
  return output

def raw_layer_norm(x, epsilon=1e-3):
	  axes = [1]
	  mean = tf.reduce_mean(x, axes, keep_dims=True)
	  std = tf.sqrt(
	      tf.reduce_mean(tf.square(x - mean), axes, keep_dims=True) + epsilon)
	  output = (x - mean) / (std)
	  return output
	


  from ._conv import register_converters as _register_converters


Let's see now the results I have with the custom implementation of the Layer normalization.

As a first attempt, I applied my custom code to the MNIST task and surprisingly I got the following results: <br>



![Custom layer normalization applied to MNIST task](images/mnist_custom.png "LSTM-LN-equations")  <br>


The experiment ended with a **testing Accuracy of 0.9921875** way better than the one I got with tf.contrib.rnn.LayerNormBasicLSTMCell 0.078125 .


## LAS with my custom layer normalization 

I applied in this experiment my custom layer normalization on both encoder,decoder and here the result I got: 



![Custom layer Normalization applied to both encoder and decoder](images/Custom_ln.png "Custom layer Normalization applied to both encoder and decoder")  <br>


Despite that the results were not as expected. We can see the big dropout in the validation loss the LAS model was made in fewer steps. <br>
This result is considered as the best one I got but still I can't figure out why the model stops converging. 

## Investigation: 

At this step, I can say that it looks like I'm getting into the black magic of tweaking my model. 
I'm not sure if it would be enough to tune the hyperparameters. <br>
Another important issue I detected was that the MFCC features were all normalied during the pre-processing phase.
That could be the case why the layerNormalization didn't work. As if the model is trying to normalize an already normalized distribution.

## Audio features processor: 


Inspecting my audio processor, I saw that all my features were normalized. This can be an explication of why the layer normalization didn't work on my model on previous experiments. 


In [None]:
#mean and variance normalize the features
        if self.conf['mvn'] == 'True':
            features = (features-np.mean(features, 0))/np.std(features, 0)

## TODO:

1) re-generate the features without normalizing them and apply after that layer normalization to my model