###  <center>In The Name of God</center>

# Introduction

<div style="text-align: justify">In this tutorial we are going to cover some important rules in programing with **tensorflow** in python. this tutorial includes basic parts for understanding the logic behind the rules.</div>

### Statical Graph **V.S.** Eager Execution
<div style="text-align: justify">Tensorflow used to work with Statical Graph. which means tensorFlow uses a dataflow graph to represent your computation in terms of the dependencies between individual operations. This leads to a low-level programming model in which you first define the dataflow graph, then create a tensorFlow session to run parts of the graph across a set of local and remote devices.</div> 

<img src=files/tensors_flowing.gif>

<div style="text-align: justify">On the other hand tensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, without building graphs: operations return concrete values instead of constructing a computational graph to run later. This makes it easy to get started with tensorflow and debug models, and also build complex models.</div> 

<div style="text-align: justify">The eager execution mode was introduced after tensorflow 1 and it seems that it will be tensorflow's default mode after tensorflow 2.</div>


In [None]:
import numpy as np
import tensorflow as tf

# session mode
print(tf.executing_eagerly())


 <span style="color:red">**Now restart the kernel**</span>

In [None]:
import numpy as np
import tensorflow as tf


tf.enable_eager_execution()

# eager mode
print(tf.executing_eagerly())

we can still use graph and sessions in eager_execution mode. the only difference is that we do not have default graph

In [None]:
print(tf.executing_eagerly())

g = tf.Graph()
with g.as_default():
    print(tf.executing_eagerly())

There are many differents between session and eager mode in tensorflow. some of them are as fallow:
* In session mode most of python methods execute once or twice or so on, due to they are forming the graph. On the contrary, in eager mode methods execute many times (for each batch).
* In eager mode, variables do not have special meaning and are not tracked. So you can update every tensor for optimizing a loss using [tf.GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape)

In [None]:
v = tf.Variable(5)
print(tf.global_variables())

with g.as_default():
    v = tf.Variable(5)
    print(tf.global_variables())

* Tensors are python objects in eager execution so they can be easily change to other formats, be deleted and ...

In [None]:
x = tf.constant(1)
x = x + 1  # previous x was deleted

with g.as_default():
    x = tf.constant(1)  # x points to first Tensor
    x = x + 1  # now x points to second Tensor

* ...


Read more about [graphs and sessions](https://www.tensorflow.org/guide/graphs) and [eager execution](https://www.tensorflow.org/guide/eager).

----------

# Methods Standards

In writing a method there are three standards that should be put under considerations
* method information (including description, args, returns, raises)
* name scope

### Method Information
Method informaiton should be complete and reproducible. It must include:
* Method description
* Input arguments, including tensors shape and types if they are important 
* Returns, including shape, type, ...
* Raises: if method checks some exception(s) explicitly

### name scope
<div style="text-align: justify">since in the statical graph programing names play an important role in low levl API's (variables and operations can be called by their names), working correctly with names can help us to have a better code in low level API, and name scopes can make our model modular since we can call all nodes with specific name scope or recall, save or update variables with same scopes.</div>

In defining methods we should adhere to three rules:
* Choose a name for the method.
* Put all operations under that name scope.
* Name the last tensor (which is going to be retured) that specific name.


In [None]:
def tensordot(A, B, name=None):
    """

    Returns the sum of element wise multiplication of A and B tensors

    Args:
        A: an arbitrary Tensor
        B: a Tensor with same shape as A
        name: (Optional)
        
    Returns:
        a scalar which is sum of element wise multiplication of A and B tensors
        
    """
    if name is None:
        name = "tensordot"
    with tf.name_scope(name):
        C = A * B  # created under name scope <name>
    return tf.reduce_sum(C, name=name)  # the operation name will be <name>


------

# Old Fashioned Models
Now assume that we are working with the session mode. we now want to see the difference between **tf.layers.dense** and **tf.layers.Dense**.

In [None]:
with tf.Graph().as_default():
    x = tf.random_normal([10, 20])
    y = tf.layers.dense(x, 15)
    print(tf.global_variables())

with tf.Graph().as_default():
    x = tf.random_normal([10, 20])
    dense = tf.layers.Dense(15)
    print(tf.global_variables())
    y = dense(x)
    print(tf.global_variables())


As we see there are three important points:

- when we use **tf.layers.dense**, it automatically create variables that are only accessible by name or graph. firstly they can be used once in graph (as a layer), and secondly it is useless to be used in eager execution.
- tf.layers.Dense (and any layer) creates its variables not when it is created, but when it is called for the first time. it is an easier way since in many cases we do not know the input size when we are creating a layer.
- if we call a layer for second time or more, then it does not create new variables.

<div style="text-align: justify">The key in here is using __*tf.get_variable()*__. unlike **tf.Variables()** it does not create a variable, instead it will be create it if the sepecific variable with that name does not exist and if it exists, it will only get it. for more details see [here](https://www.tensorflow.org/api_docs/python/tf/get_variable).</div>

In [None]:
def foo():
    with tf.variable_scope("foo", reuse=tf.AUTO_REUSE):
        v = tf.get_variable("v", [1])
    return v

with tf.Graph().as_default():
    v1 = foo()  # Creates v.
    v2 = foo()  # Gets the same, existing v.
    print(v1)
    print(v2)
    print(v1 is v2)

This trick can be used to create layers, models and so on...

In [None]:
class Dense(object):

    def __init__(self, num_units, activation=None, trainable=True, name=None):
        if name is None:
            name = "dense"
        self.name = str(name)
        self.variable_scope = tf.variable_scope(self.name, reuse=tf.AUTO_REUSE)
        self.num_units = num_units
        self.trainable = trainable
        self.activation = activation

    def __call__(self, inputs, name=None):
        if name is None:
            name = "fully-connected"
        with tf.variable_scope(self.variable_scope):
            kernel = tf.get_variable("kernel", shape=[inputs.shape[-1], self.num_units], trainable=self.trainable)
            bias = tf.get_variable("bias", shape=[self.num_units], trainable=self.trainable)
            with tf.name_scope(name):
                output = tf.matmul(inputs, kernel) + bias
                if self.activation is not None:
                    output = self.activation(output)
        return tf.identity(output, name=name)


<div style="text-align: justify">This method also works when we save and load session in training. that is instead of initializing variables, we load a session and so variables of model or layer are exist in their last weights.</div>

<span style="color:brown">How ever this method fails when we are using eager execution:</span>

In [None]:
def foo():
    with tf.variable_scope("foo", reuse=tf.AUTO_REUSE):
        v = tf.get_variable("v", [1])
    return v


v1 = foo()  # Creates v.
v2 = foo()  # Creating new v!
print(v1)
print(v2)
print(v1 is v2)

so if we want to use eager execution for debugging, we shoud use two distinct methods **build** and  **call** run build if layer is called for the first time:

In [None]:
class Dense(object):
    def __init__(self, num_units, activation=None, trainable=True, name=None):
        if name is None:
            name = "dense"
        self.name = str(name)
        self.variable_scope = tf.variable_scope(self.name, reuse=tf.AUTO_REUSE)
        self.num_units = num_units
        self.trainable = trainable
        self.activation = activation
        self.kernel = None
        self.bias = None
        self.built = False

    def build(self, inputs_shape):
        with tf.variable_scope(self.variable_scope):
            self.kernel = tf.get_variable("kernel", shape=[inputs_shape[-1], self.num_units], trainable=self.trainable)
            self.bias = tf.get_variable("bias", shape=[self.num_units], trainable=self.trainable)
        self.built = True

    def call(self, inputs, name=None):
        if name is None:
            name = "fully-connected"
        with tf.name_scope(name):
            output = tf.matmul(inputs, self.kernel) + self.bias
            if self.activation is not None:
                output = self.activation(output)
        return tf.identity(output, name=name)

    def __call__(self, inputs, name=None):
        if not self.built:
            self.build(inputs.shape)
        return self.call(inputs, name)


#### handling trainability:
Tensorflow does not provide easy way to freeze and unfreeze variables like pytorch. the only way is to pass trainable variables throw **var_list** in optimizers: 

In [None]:
with tf.Graph().as_default():
    v = tf.Variable(4.0)
    w = tf.Variable(10.0)
    init = tf.variables_initializer([v, w])
    loss = v*v + w*w
    update_v = tf.train.GradientDescentOptimizer(0.05).minimize(loss, var_list=[v])  # update v
    update_w = tf.train.GradientDescentOptimizer(0.05).minimize(loss, var_list=[w])  # update w

so in order to get trainable variables in a complex model, whether we should use scopes to get variables or use python methods to get variables. we can see a full example in the below:

In [None]:
# utils
def mask_length(length_tensor, max_length, name=None):
    """
    A method which gives the boolean mask tensor related to length_tensor. for length_tensor=[2, 1, 0, 1] and
    max_length=3 it returns [[True, True, False], [True, False, False], [False, False, False], [True, False, False]]

    Args:
        length_tensor: a non-negative integer Tensor with elements less than last dim size
        max_length:a Scalar

    Returns:
         a Tensor of shape [...length_tensor shape..., max_length] of type tf.bool

"""
    if name is None:
        name = "max_length"
    with tf.name_scope(name):
        length_tensor = tf.expand_dims(length_tensor, -1)
        ranges = tf.zeros_like(length_tensor, dtype=length_tensor.dtype) + tf.range(max_length, dtype=length_tensor.dtype)
    return tf.less(ranges, length_tensor, name=name)


# attention mechanisms
def simple_dot_attention(query, key, value, memory_length=None, memory_mask=None, name=None):
    """
    Attention method for given query, keys and values

    Args:
        query: a Tensor of shape [batch_size, query_dim]
        key: a Tensor of shape [batch_size, seq_length, query_dim]
        value: a Tensor of shape [batch_size, seq_length, value_dim]
        memory_length: (optional) an integer Tensor of shape [batch_size] which specify length of
                        memory (key and values) for each sample
        memory_mask: (optional) a bool Tensor of shape [batch_size, seq_length] for specifying the true elements of
                     memory in the condition that memory_length is not given
        name: (optional)

    Returns:
        a Tensor of shape [batch_size, value_dim] which is the result of attention mechanism

    """
    if name is None:
        name = "simple_dot_attention"
    with tf.name_scope(name):
        # handling exceptions
        if memory_length is not None and memory_mask is not None:
            raise AttributeError("Only one of memory_length and memory_mask can be specified")
        query_shape = tf.shape(query)
        key_shape = tf.shape(key)
        value_shape = tf.shape(value)
        batch_size = query_shape[0]
        seq_length = key_shape[1]
        query_dim = query_shape[1]
        if memory_length is not None:
            memory_mask = mask_length(memory_length, seq_length)
        if memory_mask is None:
            memory_mask = tf.fill([batch_size, seq_length], True)
        indices = tf.where(memory_mask)
        queries = tf.gather(query, indices[:, 0])
        keys = tf.boolean_mask(key, memory_mask)
        attention_logits = tf.reduce_sum(queries, keys)
        attention_logits = tf.scatter_nd(tf.where(memory_mask), attention_logits, [batch_size, seq_length])
        attention_logits = tf.where(memory_mask, attention_logits, tf.fill([batch_size, seq_length], -float("Inf")))
        attention_coefficients = tf.nn.softmax(attention_logits)
        attention = tf.expand_dims(attention_coefficients, -1) * value
    return tf.reduce_sum(attention, 1, name=name)


def multiple_dot_attention(query, key, value, query_length=None, query_mask=None, memory_length=None,
                           memory_mask=None, name=None):
    """
    Attention method for given queries, keys and values, which in each sample we have multiple queries.
    (a sequence of queries)

    Args:
        query: a Tensor of shape [batch_size, q_length, query_dim]
        key: a Tensor of shape [batch_size, seq_length, query_dim]
        value: a Tensor of shape [batch_size, seq_length, value_dim]
        query_length: (optional) an integer Tensor of shape [batch_size] which specify length of
                        queries for each sample
        query_mask: (optional) a bool Tensor of shape [batch_size, query_length] for specifying  the true
                    elements of queries in the condition that query_length is not given
        memory_length: (optional) an integer Tensor of shape [batch_size] which specify length of
                        memory (key and values) for each sample
        memory_mask: (optional) a bool Tensor of shape [batch_size, seq_length] for specifying the true elements of
                     keys and values in the condition that memory_length is not given
        name: (optional)

    Returns:
        a Tensor of shape [batch_size, q_length, value_dim] which is the result of attention mechanism

    """
    if name is None:
        name = "multiple_dot_attention"
    with tf.name_scope(name):
        if query_length is not None and query_mask is not None:
            raise AttributeError("Only one of query_length and query_mask can be specified")
        if memory_length is not None and memory_mask is not None:
            raise AttributeError("Only one of memory_length and memory_mask can be specified")
        query_shape = tf.shape(query)
        key_shape = tf.shape(key)
        value_shape = tf.shape(value)
        batch_size = query_shape[0]
        q_length = query_shape[1]
        seq_length = key_shape[1]
        query_dim = query_shape[2]
        value_dim = value_shape[2]
        if query_length is not None:
            query_mask = mask_length(query_length, q_length)
        if query_mask is None:
            query_mask = tf.fill([batch_size, q_length], True)
        if memory_length is not None:
            memory_mask = mask_length(memory_length, seq_length)
        if memory_mask is None:
            memory_mask = tf.fill([batch_size, seq_length], True)
        indices = tf.where(query_mask)
        query = tf.boolean_mask(query, query_mask)
        key = tf.gather(key, indices[:, 0])
        value = tf.gather(value, indices[:, 0])
        memory_mask = tf.gather(memory_mask, indices[:, 0])
        attention = simple_dot_attention(query, key, value, memory_mask=memory_mask)
    return tf.scatter_nd(indices, attention, [batch_size, q_length, value_dim], name=name)



In [None]:
class Linear(object):
    def __init__(self, units, activation=None, use_bias=True, kernel_initializer=None, bias_initializer=None,
                 trainable=True, dtype=tf.float32, name=None):
        """
        Linear dense layer for sequential data.

        Args:
             units: the size of output representation
             activation: (Optional) layer activation
             use_bias: (Optional)
             kernel_initializer: (Optional)
             bias_initializer: (Optional)
             trainable: (Optional)
             dtype: (Optional)
             name: (Optional)

        """
        # general init
        if name is None:
            name = "Linear"
        self.name = str(name)
        self.variable_scope = tf.variable_scope(name)
        self._variables = []
        self._trainable_variables = {}
        self.built = False

        # importing inputs
        self.units = units
        self.activation = activation
        self.use_bias = use_bias
        self.kernel_initializer = kernel_initializer
        self.bias_initializer = bias_initializer
        self._trainable = bool(trainable)
        self._dtype = dtype

        # setting attributes
        self.kernel = None
        self.bias = None
        self.inputs_dim = None

    @property
    def trainable(self):
        return self._trainable

    @trainable.setter
    def trainable(self, t):
        assert isinstance(t, bool)
        self._trainable = t
        for variable in self._trainable_variables:
            self._trainable_variables[variable] = t

    @property
    def variables(self):
        return self._variables

    @property
    def trainable_variables(self):
        return [var for var in self._variables if self._trainable_variables[var]]

    def build(self, input_shape):
        """
            Variables:
                kernel: a Tensor of shape [input_dim, self.units]
                bias: a Tensor of shape [self.units] if use_bias = True

        """
        self.inputs_dim = input_shape[-1]
        with tf.variable_scope(self.variable_scope):
            self.kernel = tf.get_variable(name="kernel", shape=[self.inputs_dim, self.units],
                                          dtype=self._dtype, initializer=self.kernel_initializer)
            self._variables.append(self.kernel)
            self._trainable_variables[self.kernel] = self.trainable
            if self.use_bias:
                self.bias = tf.get_variable(name="bias", shape=[self.units], dtype=self.dtype,
                                            initializer=self.bias_initializer)
                self._variables.append(self.bias)
                self._trainable_variables[self.bias] = self.trainable
        self.built = True

    def call(self, inputs, inputs_num=None, inputs_mask=None, name=None):
        """

        Args:
            inputs: a Tensor of shape [batch_size, seq_length, input_dim]
            inputs_num: (Optional) an integer Tensor of [batch_size] which specify length of
                        inputs for each sample
            inputs_mask: (Optional) a bool Tensor of shape [batch_size, seq_length] for specifying  the true
                         elements of inputs in the condition that inputs_num is not given
            name: (Optional)

        Returns:
            a Tensor of shape [batch_size, seq_length, self.units]

        """
        if name is None:
            name = "linear"

        with tf.name_scope(self.name):
            if inputs_num is not None and inputs_mask is not None:
                raise AttributeError("only one of inputs_num or inputs_mask should be specified")
            inputs_shape = tf.shape(inputs)
            batch_size = inputs_shape[0]
            seq_length = inputs_shape[1]
            inputs_dim = inputs_shape[2]
            # main code
            if inputs_num is not None:
                inputs_mask = mask_length(inputs_num, seq_length)
            if inputs_mask is None:
                inputs_mask = tf.fill([batch_size, seq_length], True)
            indices = tf.where(inputs_mask)
            inputs = tf.boolean_mask(inputs, inputs_mask)
            outputs = tf.matmul(inputs, self.kernel)
            if self.bias is not None:
                outputs = outputs + self.bias
            if self.activation is not None:
                outputs = self.activation(outputs)
            outputs = tf.scatter_nd(indices, outputs, [batch_size, seq_length, self.units])
        return tf.identity(outputs, name=name)

    def __call__(self, inputs, inputs_num=None, inputs_mask=None, name=None):
        """

        Args:
            inputs: a Tensor of shape [batch_size, seq_length, input_dim]
            inputs_num: (Optional) an integer Tensor of [batch_size] which specify length of
                        inputs for each sample
            inputs_mask: (Optional) a bool Tensor of shape [batch_size, seq_length] for specifying  the true
                         elements of inputs in the condition that inputs_num is not given
            name: (Optional)

        Returns:
            a Tensor of shape [batch_size, seq_length, self.units]

        """
        if not self.built:
            self.build(inputs.shape)
        return self.call(inputs, inputs_num, inputs_mask, name)


class SelfAttention(object):
    def __init__(self, units, initializer=None, trainable=True, name=None):
        """
        Self attention layer, give a sequence as input and apply bidirectional self-attention mechanism

        Args:
             units: the representation size of queries and keys
             initializer: (Optional) initializer for Query and Key map's weights
             trainable: (Optional)
             name: (Optional)
        """
        # general init
        if name is None:
            name = "self_attention"
        self.name = str(name)
        self.variable_scope = tf.variable_scope(self.name)
        self.layers = []
        self._variables = []
        self._trainable_variables = {}
        self.built = False

        # importing inputs
        self.units = units
        self.initializer = initializer
        self._trainable = bool(trainable)

        # setting attributes
        with self.variable_scope:
            self.key_layer = Linear(self.units, use_bias=False, kernel_initializer=self.initializer,
                                    trainable=self.trainable, name="key")
            self.query_layer = Linear(self.units, use_bias=False, kernel_initializer=self.initializer,
                                      trainable=self.trainable, name="query")
            self.layers.append(self.key_layer)
            self.layers.append(self.query_layer)
        self.inputs_dim = None

    @property
    def trainable(self):
        return self._trainable

    @trainable.setter
    def trainable(self, t):
        t = bool(t)
        self._trainable = t
        for variable in self._trainable_variables:
            self._trainable_variables[variable] = t
        for layer in self.layers:
            layer.trainable = t

    @property
    def variables(self):
        return self._variables + [layer.variables for layer in self.layers]

    @property
    def trainable_variables(self):
        return [var for var in self._variables if self._trainable_variables[var]] +\
               [layer.trainable_variables for layer in self.layers]

    def build(self, input_shape):
        self.inputs_dim = input_shape[-1]
        self.built = True

    def call(self, inputs, inputs_num=None, inputs_mask=None, name=None):
        """

        Args:
            inputs: a Tensor of shape [batch_size, seq_length, input_dim]
            inputs_num: (Optional) an integer Tensor of [batch_size] which specify length of
                        inputs for each sample
            inputs_mask: (Optional) a bool Tensor of shape [batch_size, seq_length] for specifying  the true
                         elements of inputs in the condition that inputs_num is not given
            name: (Optional)

        Returns:
            a Tensor of shape [batch_size, seq_length, self.units]

        """
        if name is None:
            name = "self_attention"

        with tf.name_scope(self.name):
            key = self.key_layer(inputs, inputs_num, inputs_mask)
            query = self.query_layer(inputs, inputs_num, inputs_mask)
            result = multiple_dot_attention(query, key, inputs, inputs_num, inputs_mask, inputs_num, inputs_mask)
        return tf.identity(result, name=name)

    def __call__(self, inputs, inputs_num=None, inputs_mask=None, name=None):
        """

        Args:
            inputs: a Tensor of shape [batch_size, seq_length, input_dim]
            inputs_num: (Optional) an integer Tensor of [batch_size] which specify length of
                        inputs for each sample
            inputs_mask: (Optional) a bool Tensor of shape [batch_size, seq_length] for specifying  the true
                         elements of inputs in the condition that inputs_num is not given
            name: (Optional)

        Returns:
            a Tensor of shape [batch_size, seq_length, self.units]

        """
        if not self.built:
            self.build(inputs.shape)
        return self.call(inputs, inputs_num, inputs_mask, name)


-----

# S.P. standard is to use tf.keras
<div style="text-align: justify">
Keras is a high-level API to build and train deep learning models. It's used for fast prototyping, advanced research, and production, with three key advantages:</div>

* _User friendly:_ Keras has a simple, consistent interface optimized for common use cases. It provides clear and actionable feedback for user errors.
* _Modular and composable:_ Keras models are made by connecting configurable building blocks together, with few restrictions.
* _Easy to extend:_ Write custom building blocks to express new ideas for research. Create new layers, loss functions, and develop state-of-the-art models.
<div style="text-align: justify">
The problem with old-fashioned style is that it does not support eager execution completely. for instance if you want to train your model in eager execution mode, you can't save your model easily or load it. Keras supports both eager and session execution perfectly and inhance you to use the modularity facilities of keras such as handling variables and weights automatically (so you won't go throw all hardships we did). Besides many tensorflow's utilites are comming from **tf.keras**; for instance, all layers **tf.layers** are comming from **tf.keras.layers**. It also handle session automatically and we do not need to call session.</div>



In [None]:
model = tf.keras.Sequential([tf.keras.layers.Dense(64, activation='relu'), 
                             tf.keras.layers.Dense(64, activation='relu'), 
                             tf.keras.layers.Dense(10, activation='softmax')])

model.compile(optimizer=tf.train.AdamOptimizer(0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])
data = np.random.random((100, 32))
labels = np.random.random((100, 10))

model.fit(data, labels, epochs=10, batch_size=32)


### Constructing Models and Layers
<div style="text-align: justify">All models and layers (those who have variables) should be constructed under **tf.keras.layers.Layer** or **tf.keras.Model**. Keras models and layers will automatically handle variables, weights, trainability and scopes. so in most cases it is only necessary to implement just three methods:</div>
* \_\_init\_\_
* build
* call

Note that many objects are hidden and are named by prefix "\_". etc: **self.\_name**. here is an implementation of above layers using tf.keras:

In [None]:
class Linear(tf.layers.Dense):
    def __init__(self, units, activation=None, use_bias=True, kernel_initializer=None, bias_initializer=None,
                 trainable=True, name=None):
        """
        Linear dense layer for sequential data.

        Args:
             units: the size of output representation
             activation: (Optional) layer activation
             use_bias: (Optional)
             kernel_initializer: (Optional)
             bias_initializer: (Optional)
             trainable: (Optional)
             name: (Optional)

        """
        # general init
        if name is None:
            name = "Linear"
        super().__init__(units, activation, use_bias, kernel_initializer, bias_initializer,
                         trainable=trainable, name=name)
        with tf.variable_scope(self.name):
            self._set_scope()
        self._inputs_dim = None

    def build(self, inputs_shape):
        self._inputs_dim = inputs_shape[0][-1]
        super().build([1, self._inputs_dim])

    def call(self, inputs):
        """

        Args:
            inputs: whether a Tensor of shape [batch_size, seq_length, input_dim] or a tuple of tensors, first a Tensor
            of shape [batch_size, seq_length, input_dim] which is inputs and a Tensor of type tf.bool and size of
            [batch_size, seq_length] for specifying the true elements of inputs


        Returns:
            a Tensor of shape [batch_size, seq_length, self.units]

        """
        if not isinstance(inputs, tf.Tensor):
            inputs_mask = None
        else:
            inputs, inputs_mask = inputs
        inputs_shape = tf.shape(inputs)
        batch_size = inputs_shape[0]
        seq_length = inputs_shape[1]
        inputs_dim = inputs_shape[2]
        indices = tf.where(inputs_mask)
        inputs = tf.boolean_mask(inputs, inputs_mask)
        outputs = super().call(inputs)
        outputs = tf.scatter_nd(indices, outputs, [batch_size, seq_length, self.units])
        return outputs

    
class SelfAttention(tf.keras.layers.Layer):
    def __init__(self, units, initializer=None, trainable=True, name=None):
        """
        Self attention layer, give a sequence as input and apply bidirectional self-attention mechanism

        Args:
             units: the representation size of queries and keys
             initializer: (Optional) initializer for Query and Key map's weights
             trainable: (Optional)
             name: (Optional)
        """
        # general init
        if name is None:
            name = "Linear"
        super().__init__(trainable, name)
        with tf.variable_scope(self.name):
            self._set_scope()
        self._units = units
        self._inputs_dim = None
        self._layers = []
        with tf.variable_scope(self.name):
            with tf.name_scope(self.name):
                self._key_dense = Linear(self._units, use_bias=False, kernel_initializer=initializer,
                                         trainable=self.trainable, name="key_dense")
                self._query_dense = Linear(self._units, use_bias=False, kernel_initializer=initializer,
                                           trainable=self.trainable, name="query_dense")
        self._layers = [self._key_dense, self._query_dense]

    @property
    def layers(self):
        return self._layers.copy()

    @property
    def variables(self):
        return sum([layer.variables for layer in self.layers], [])

    @property
    def trainable_variables(self):
        return sum([layer.trainable_variables for layer in self.layers], [])

    def call(self, inputs, **kwargs):
        """

        Args:
            inputs: whether a Tensor of shape [batch_size, seq_length, input_dim] or a tuple of tensors, first a Tensor
            of shape [batch_size, seq_length, input_dim] which is inputs and a Tensor of type tf.bool and size of
            [batch_size, seq_length] for specifying the true elements of inputs


        Returns:
            a Tensor of shape [batch_size, seq_length, self.units]

        """
        key = self._key_layer(inputs)
        query = self._query_layer(inputs)
        if not isinstance(inputs, tf.Tensor):
            inputs_mask = None
        else:
            inputs, inputs_mask = inputs
        result = multiple_dot_attention(query, key, inputs, query_mask=inputs_mask, memory_mask=inputs_mask)
        return result


Here is a model for bidirectional language model using transformer:

In [None]:
class Transformer(tf.keras.Sequential):
    def __init__(self, units, num_blocks, trainable=True, activation=tf.sigmoid, name=None):
        """
        simple transformer with self-attention and linear layers

        Args:
            units: an integer which shows the output representation
            num_blocks: an integer number of num_blocks

        """
        assert num_blocks >= 0
        if name is None:
            name = "Linear"
        super().__init__(name=name)
        with tf.variable_scope(self.name):
            self._set_scope()
        with tf.variable_scope(self.name):
            with tf.name_scope(self.name):
                for i in range(num_blocks):
                    self.add(SelfAttention(units, trainable=trainable, name="self_attention"))
                    self.add(Linear(units, activation=activation, trainable=trainable, name="dense"))
             
            

class Model(tf.keras.Model):
    def __init__(self, vocab_size, embedding, num_blocks=3, transformer_units=100,trainable=True, name=None):
        """
        the model consist of an embedding layer, a transformer consist of num_blocks blocks and a softmax layer in the
        end which gives the probability vector of the blank word in the sentence. the model uses a trainable vector for
        blank embedding.

        Args:
            embedding: the word embedding for vocabs. a Tensor of shape [vocab_size, embedding_size]
            num_blocks: (Optional) an integer which specify the number of transformer blocks
            transformer_units: (Optional) an integer which shows the output representation.
            trainable: (Optional)
            name: (Optional)

        """
        assert num_blocks >= 0
        if name is None:
            name = "Linear"
        super().__init__(name=name)
        self._vocab_size = vocab_size
        self._embedding_size = embedding.shape[1]
        self._blank_embedding = None
        with tf.variable_scope(self.name):
            self._set_scope()
        with tf.variable_scope(self.name):
            with tf.name_scope(self.name):
                embedding_initializer = tf.keras.initializers.constant(embedding)
                self._embedding = tf.keras.layers.Embedding(vocab_size, self._embedding_size,
                                                            embedding_initializer, trainable=False)
                self._transformer = Transformer(transformer_units, num_blocks,
                                                trainable=self.trainable, name="transformer")
                self._softmax = tf.layers.Dense(vocab_size, tf.nn.softmax, trainable=self.trainable)
        self._layers = self._layers + [self._embedding] + [layer for layer in self._transformer.layers] +\
                       [self._softmax]

    def build(self, input_shape):
        self._blank_embedding = self.add_weight("blank_embedding", [self._embedding_size])

    def call(self, inputs, training=None, mask=None):
        """

        Args:
            inputs: a triple of tensors (inputs, blanks, num_inputs)
                    inputs: an integer Tensor of shape [batch_size, seq_length]
                    blanks: an integer Tensor of shape [batch_size] which specified the blank word in each sample
                    num_inputs: an integer Tensor of shape [batch_size] which specified the length of sequence in
                                each sample

        Returns:
            a Tensor of shape [batch_size, vocab_size] which is the probability vector for blank words

        """
        inputs, blanks, num_inputs = inputs
        inputs_shape = tf.shape(inputs)
        batch_size = inputs_shape[0]
        seq_length = inputs_shape[1]
        blanks_indices = tf.concat([tf.range(seq_length), blanks], 1)
        blank_mask = tf.scatter_nd(blanks_indices, tf.fill([batch_size]), [batch_size, seq_length])
        inputs_mask = mask_length(num_inputs)
        x = self._embedding(inputs)
        blanks_embedding = tf.zeros_like(x) + self._blank_embedding
        x = tf.where(blank_mask, blanks_embedding, x)
        x = self._transformer([x, inputs_mask])
        probs = self._softmax(x)
        return probs



----

# Some other tips!

* Use tf.dataset module to feed data to models. see [here](https://www.tensorflow.org/guide/datasets).



In [None]:
# Instantiates a toy dataset instance:
dataset = tf.data.Dataset.from_tensor_slices((data, labels))
dataset = dataset.batch(16)
dataset = dataset.repeat()

# Don't forget to specify `steps_per_epoch` when calling `fit` on a dataset.
model.fit(dataset, epochs=10, steps_per_epoch=11)


* write different parts of the code (model, utils, pre-processing, importing data, ...) in different python files if some of them are complex, in different folders.
* Do not specify device for operations in the main code. use **tf.device** in sub branchs in git after debugging the model. Tensorflow will automatically regulate operations.
* Take care to use allow_growth in session's config when running on server.
* update your github daily or even more frequent!!