# Semester Project:
# exploring Differentiable Neural Computers and applications

## Introduction
What we want to present in the following notebook is a cutting-edge topic in machine learning, related to a particular model of Neural Network called **Differentable Neural Computer** (DNC) which puts together a Neural Network and an external memory from which the Netwoek can read and write. The main source of this noteboook is based on the article and paper which can be found [here](https://deepmind.com/blog/differentiable-neural-computers/).
<br>

We start with a presentation of the DNC and how they are enhanced form *Neural Turing Machines*. Then we move to the core topic of this networks which are the three **differentiable attention mechanisms** with whihc the network performs the read and writes operations. We propose then a code for the implementation of these Networks, testing their performances and limitations. Finally we will try to implement the graph task presented in the paper and adapt it to a different problem.
<hr>

## DNC
The Differentiable Neural Computers are neural networks with a coupled memory, where the dimension of this memory does not affect the behaviour of the network. The memory can be tought as a RAM and the Network as a controller which is differentiable using gradient descent-like tecniques.
<br>
Comparing this Network with the respect of other whihc present memory states such as RNN or LSTM, DNCs do not store information whihc is tigth coupled with the model itself. Instead, they have the possibility to selectively read and write to and from memory locations, creating a separation between state preservation and content (data).
<br>

In the litterature, another network capable of read and write using an external memory came out in the past, called **Neural Turing Machine** (NTM). Altought the concept is similar to the DNC, with a Network acting as controller and an external memory, they are more restricted in the method with which the Neural Network access the memory. 

<br>
![DNC](./figures/DNC_architecture.png)
<br>

The image shows the architecture of a DNC Neural Network. In this specif example, the controller is a Recursive Neural Network (a LSTM is also possible) which is receiving an input (from a dataset) and the previous step readings from the memory. The output of the network is made of two parts:
- The output of the Neural Network, which constitute our target
- A vector called **'Interface Vector'** which is used by the read and write heads to interact with the memory

As it's clear by now, the DNC is mainly composed of two parts interacting together: the **Neural Netwoerk** itself (**a**) and the **Heads** (**b**) which performs the operations over the **Memory** (**c**). Those oprations are the key differences of DNC over NTM and will be described in details in the next paragraph. 

The Heads are particular components which are dealing with the content of the memory. It's possible to define as many read and write heads as needed, all of them will receive weights vector which are used to define the location over which perform the read and write operations. The differentiability of these parts allows the network to learn how to perform these operations simple by looking at the error during the trainign process and adjusting the weights.

Finally, in addition to the Memory there is an additional information whcih is saved and used during the read and writes operations. Those are the **Memory Usage and Temporal Links** (**d**) which is an imprvement over the NTM. Those links and usage information allows to dynamically allocate the content during the reads and to have notion of temporal assosciation between entries of the memory. That is, it's possibile to know the sequence of the writtings which is extremely important when the Network has to deal to sequential tasks such as graph path search.
<hr>


## Memory Interactions

### Overview
The core mechanism of DNCs is the possibility to write and read from an external memory matrix. As mentioned, the difference with the NTMs is in defining more **attention mechanisms**. This is different from the address mechanisms in conventional computers, where there is a mapping between the address and the content of the memory. The idea here is to define the weights over the location of the Memory. Those weights represents a degree (we can immagine it has a filter) that indicates how much the locations in the Memory matrix are involved during read and writing operations.
<br>
Briefly:
- a read vector $\mathbf{r}$ is returned after a read operation which involves a read weights matrix $\mathbf{w^r}$:<br>
    $\mathbf{r}=\sum_{i=1}^N M[i,j]w^r[i]$ for $j=1, \dots,W$
    
- an erase vector $\mathbf{e}$ is applied using the write weights matrix $\mathbf{w^w}$, then a a write vector is added $\mathbf{v}$:<br>
    $\mathbf{v}: M[i,j] \leftarrow M[i,j](1-\mathbf{w^w}[i]\mathbf{e}[j]+\mathbf{w^w}[i]\mathbf{v}[j])$

The units which determines this operations are the read and write Heads. 

### Differentiable Attention
There are three forms of differentiable attention. This is more a fancy terminology to call the three pricipal techniques involved in in order to address the memory. Before going into the details we want to recall some of the techniques which are instead used in NTM to appreasciate the difference with the DNC. 

There are two mechanism for addressing the memory and the Neural Tuning Machines combines both:
- *content-based addressing*: focuses attention on locations based on the similarity between their current values and values emitted by the controller. This is related to the content addressing of the Hopfield networks, the controller needs to generate a value which is an approximation of the one stored to then retrieve the exact location. 
- *location-based addressing*: it's the traditional approach to address the memory. For arithmetic operation where we need to define variables the conent based in not enough, we need the location of the variable to perform the operation.

Differentiable Neural Computers uses a *content-based addressing* paired with other two techniques which allows for a **Dynamic Memory Addressing** counteracting the major drawbacks of the NTM:
1. NTMs do not avoid possible overlapping and interfere among blocks of allocating memory. DNCs instead overcame this problem due to there is only a single free at each write time.
2. NTMs does not allow for freeing location of memory which are not used and this can be a problem when processing long sequences. DNCs instead can free memory locations based on the usage weights.
3. NTMs sequential information is preserved only if the content is written in consecutive locations. DNCs uses an additional temporal link matrix avoidin the restriction to continuous locations only. 

For now we explain the concepts of these three differentiable attention. A more detailed analysis will be given in the following paragraphs

#### Content base addressing
This form is used to determine the similarity measure between the vector emmited by the controller and the content of the memory. The measure is a cosine similarity function that returns weights which are then used by the read head for associative recall or by the write head to modify the conetent of the memory. In addition, if the key only match a part of the conentent of the memory this is still useful and can lead to the retrieval of that location. This may be due to the key may not have all the information which instead are stored in the memory.

#### Temporary links
This form keep track of the transitions betweens locations which were consecutively written using an $LxL$ temporal *link matrix* $L$. This matrix associate a weight from 0 to 1 for each pair of locations in the matrix, where the entry $L[i,j]$ is the temporal relation between the location $i$ and $j$. The weight is closer to $1$ if the locations $i$ was written after $j$, otherwise the value is closer to $0$. This gives to the Neural Network the ability to recover sequences following the order under which they were written.
<br>

The product $L\mathbf{w}$ creat a smoothing effect, shifting the focus forwards the locations after those emphasized by $\mathbf{w}$. That is, after a writting which is based on $\mathbf{w}$

#### Usage 
This form is used to allocate the memory for writting. The usage is a value between $0$ and $1$, with a weighting to select unsued locations that is delivered to the writing head. The usage is incremented after each write to that location and decreased after each read. The good property is that this is independent from the memory size and content. This allows the network to be trained and then upgrade to a larger one.

<hr>

## Details

<hr>

## Implementation
The code has been taken from [github](https://github.com/deepmind/dnc). The idea is to analyse it and further modify or add some parts of the implementation in order to achieve our goals. For now we just list the major modules which are used to implement the DNC. Then we will used the *Repeat Copy Task* to perfrom some tests on the DNC. The final goal from the implementation point of view is to extend this one with the *Graph Task*
<br>
The modules provided are the following:
- **Addressing and Access  Modules**
    The following code implements the addressing methods discussed above. In addition another script provides the function necessary to manipulate the memory exploiting the addressing methods already defined.
- **DNC Module**
    The following code represents the core of the Differentiable Neural Computer. This is an extention of the class RNN core, which include also the manipulation with the memory. We have to reming that the Controller is implemented as LSTM reccurent network.
- **Repeat Copy Task**
    The folowing code is used to create a dataset for the copy task. The DNC receives in input batches of collections of string which have to be copied a number of times which is defined randomly.
- **Training Module** 
    Finally the code for the trainig is porvided. This code allows also to save the variables of the model for checkpoint it and restore the values for an input evaluation. Note that the variables are saved and not the model, this mean that it's needed to reacreate the graph then loead the variables during a session.

<br>
Here follows the code license:

```python
# ==========================================================================
# Copyright 2017 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==========================================================================
```

### Tests
TO performs some test on the model we had to modify the way the dataset was originted. The first test is to understand how much the model is sensitive to noised data. We have taken the original generation code and applied some modifications. For random betchs (where the randomness is further explained) we flipped the values of the bit in the string to copy alterning the target. That is, to deceive the DNC to perform it's original task of copying a string of bits several times we have changed the target. So we let the DNC assume it made it wrong and underdtand if for some samples this may cause a non convergence of the error to 0.
<br>
Before analysizing the noised version we have to test the DNC in a normal copy task with non altered target values. We can exploit the modified code in order to create the training environment over which manipulate the DNC training which allows for both non noised and noised training.

#### Repeat_copy_noise.py
Here we show our modified **repeat_copy.py** script for runnin the noised test. A part from some minor changes to the interface, the added parts in the code are highlightes with comments *### MODIFIED ###*.

In [4]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import collections
import tensorflow as tf
import numpy as np
import random
import argparse

In [53]:
from repeat_copy import RepeatCopy

DatasetNoisedTensors = collections.namedtuple('DatasetNoisedTensors', ('observations',
                                                                       'target',
                                                                       'target_noise',
                                                                       'mask',
                                                                       'distortion'))


def custom_bit_string_readable(data, batch_size, model_output=None, whole_batch=False,
                               with_noise=False, with_distortion=False):
    """Produce a human readable representation of the sequences in data.

    Args:
      data: data to be visualised
      batch_size: size of batch
      model_output: optional model output tensor to visualize alongside data.
      whole_batch: whether to visualise the whole batch. Only a random sample of the
          batch will be visualized
      with_noise: decide to visualize or not the noised batch
      with_distortion: decide to print or not the distortion value computed as cosine
          similarity between the real target and the noised one

    Returns:
      A string used to visualise the data batch
    """

    def _readable(datum):
        return '+' + ' '.join(['-' if x == 0 else '%d' % x for x in datum]) + '+'

    obs_batch = data.observations
    targ_batch = data.target
    targ_noise_batch = data.target_noise
    dist_batch = data.distortion
    iterate_over = range(batch_size) if whole_batch else [random.randint(0, batch_size - 1)]

    batch_strings = []
    for batch_index in iterate_over:
        obs = obs_batch[:, batch_index, :]
        targ = targ_batch[:, batch_index, :]
        targ_n = targ_noise_batch[:, batch_index, :]
        dist = dist_batch[batch_index]

        obs_channels = range(obs.shape[1])
        targ_channels = range(targ.shape[1])
        obs_channel_strings = [_readable(obs[:, i]) for i in obs_channels]
        targ_channel_strings = [_readable(targ[:, i]) for i in targ_channels]
        targ_noise_channel_strings = [_readable(targ_n[:, i]) for i in targ_channels]

        readable_obs = 'Observations:\n' + '\n'.join(obs_channel_strings)
        readable_targ = 'Targs:\n' + '\n'.join(targ_channel_strings)
        readable_targ_n = 'Target_noise: \n' + '\n'.join(targ_noise_channel_strings)
        readable_dist = 'Distortion {}\n'.format(dist)

        if with_noise and with_distortion:
            strings = [readable_obs, readable_targ, readable_targ_n, readable_dist]
        elif with_distortion and not with_distortion:
            strings = [readable_obs, readable_targ, readable_dist]
        elif with_noise and not with_distortion:
            strings = [readable_obs, readable_targ, readable_targ_n]
        else:
            strings = [readable_obs, readable_targ]

        if model_output is not None:
            output = model_output[:, batch_index, :]
            output_strings = [_readable(output[:, i]) for i in targ_channels]
            strings.append('Model Output:\n' + '\n'.join(output_strings))

        batch_strings.append('\n\n'.join(strings))

    return '\n' + '\n\n\n\n'.join(batch_strings)


class NoisedRepeatCopy(RepeatCopy):

    def __init__(self, num_bits=6, batch_size=1, min_length=1, max_length=1, min_repeats=1, max_repeats=2, 
                 norm_max=10, log_prob_in_bits=False, time_average_cost=False, 
                 name='repeat_copy', noise_level=None):
        super(NoisedRepeatCopy, self).__init__(num_bits, batch_size, min_length, max_length, min_repeats, 
                                               max_repeats, norm_max, log_prob_in_bits, time_average_cost, name)

        self.noise_level = noise_level

    def _build(self):
        min_length, max_length = self._min_length, self._max_length
        min_reps, max_reps = self._min_repeats, self._max_repeats
        num_bits = self.num_bits
        batch_size = self.batch_size

        full_obs_size = num_bits + 2
        full_targ_size = num_bits + 1
        start_end_flag_idx = full_obs_size - 2
        num_repeats_channel_idx = full_obs_size - 1

        sub_seq_length_batch = tf.random_uniform(
            [batch_size], minval=min_length, maxval=max_length + 1, dtype=tf.int32)
        num_repeats_batch = tf.random_uniform(
            [batch_size], minval=min_reps, maxval=max_reps + 1, dtype=tf.int32)

        total_length_batch = sub_seq_length_batch * (num_repeats_batch + 1) + 3
        max_length_batch = tf.reduce_max(total_length_batch)
        residual_length_batch = max_length_batch - total_length_batch

        obs_batch_shape = [max_length_batch, batch_size, full_obs_size]
        targ_batch_shape = [max_length_batch, batch_size, full_targ_size]
        mask_batch_trans_shape = [batch_size, max_length_batch]

        obs_tensors = []
        targ_tensors = []
        # include also the noised version
        targ_noise_tensors = []
        mask_tensors = []
        # distortion
        distortion = []

        ### MODIFIED FROM HERE ######

        for batch_index in range(batch_size):
            sub_seq_len = sub_seq_length_batch[batch_index]
            num_reps = num_repeats_batch[batch_index]

            obs_pattern_shape = [sub_seq_len, num_bits]
            obs_pattern = tf.cast(
                tf.random_uniform(
                    obs_pattern_shape, minval=0, maxval=2, dtype=tf.int32),
                tf.float32)

            targ_pattern_shape = [sub_seq_len * num_reps, num_bits]
            flat_obs_pattern = tf.reshape(obs_pattern, [-1])
            flat_targ_pattern = tf.tile(flat_obs_pattern, tf.stack([num_reps]))

            # perturbation of the target
            def addNoise(x):
                #print('Added Noise'.format(batch_index))
                val = np.random.randint(1, 100)

                if (self.noise_level is not None) and (val > self.noise_level):
                    #print('Noise applied!', val)

                    def f1():
                        return tf.add(x, 1)

                    def f2():
                        return tf.add(x, -1)

                    output = tf.cond(tf.equal(x, tf.constant(1, dtype=tf.float32)), true_fn=f2, false_fn=f1)
                    return output   
                else:
                    return x

            # create a noised version of the target
            noised_flat_targ_pattern = tf.map_fn(addNoise, flat_targ_pattern)

            ### EVALUATE THE DISTORTION
            # compute how many bits are changed as a distortion metric
            distortion.append(
                tf.reduce_sum(
                    tf.square(
                        tf.subtract(
                            noised_flat_targ_pattern, flat_targ_pattern
                        )
                    )
                )
            )

            # reshape the tensors (both the original and the noised one)
            targ_pattern = tf.reshape(flat_targ_pattern, targ_pattern_shape)
            target_noise_pattern = tf.reshape(noised_flat_targ_pattern, targ_pattern_shape)

            ##### END MODIFICATIONS

            obs_flag_channel_pad = tf.zeros([sub_seq_len, 2])
            obs_start_flag = tf.one_hot(
                [start_end_flag_idx], full_obs_size, on_value=1., off_value=0.)
            num_reps_flag = tf.one_hot(
                [num_repeats_channel_idx],
                full_obs_size,
                on_value=self._normalise(tf.cast(num_reps, tf.float32)),
                off_value=0.)

            # note the concatenation dimensions.
            obs = tf.concat([obs_pattern, obs_flag_channel_pad], 1)
            obs = tf.concat([obs_start_flag, obs], 0)
            obs = tf.concat([obs, num_reps_flag], 0)

            # Now do the same for the targ_pattern (it only has one extra channel).
            targ_flag_channel_pad = tf.zeros([sub_seq_len * num_reps, 1])
            targ_end_flag = tf.one_hot(
                [start_end_flag_idx], full_targ_size, on_value=1., off_value=0.)
            targ = tf.concat([targ_pattern, targ_flag_channel_pad], 1)
            targ = tf.concat([targ, targ_end_flag], 0)

            ### INCLUDE THE NOISED ONE
            targ_noise = tf.concat([target_noise_pattern, targ_flag_channel_pad], 1)
            targ_noise = tf.concat([targ_noise, targ_end_flag], 0)
            ###

            # This aligns them s.t. the target begins as soon as the obs ends.
            obs_end_pad = tf.zeros([sub_seq_len * num_reps + 1, full_obs_size])
            targ_start_pad = tf.zeros([sub_seq_len + 2, full_targ_size])

            # The mask is zero during the obs and one during the targ.
            mask_off = tf.zeros([sub_seq_len + 2])
            mask_on = tf.ones([sub_seq_len * num_reps + 1])

            obs = tf.concat([obs, obs_end_pad], 0)
            targ = tf.concat([targ_start_pad, targ], 0)

            ### INCLUDE THE NOISED ONE
            targ_noise = tf.concat([targ_start_pad, targ_noise], 0)
            ###

            mask = tf.concat([mask_off, mask_on], 0)

            obs_tensors.append(obs)
            targ_tensors.append(targ)

            ### INCLUDE THE NOISED ONE
            targ_noise_tensors.append(targ_noise)
            ###

            mask_tensors.append(mask)

        # End the loop over batch index.
        # Compute how much zero padding is needed to make tensors sequences
        # the same length for all batch elements.
        residual_obs_pad = [
            tf.zeros([residual_length_batch[i], full_obs_size])
            for i in range(batch_size)
        ]
        residual_targ_pad = [
            tf.zeros([residual_length_batch[i], full_targ_size])
            for i in range(batch_size)
        ]
        residual_mask_pad = [
            tf.zeros([residual_length_batch[i]]) for i in range(batch_size)
        ]

        # Concatenate the pad to each batch element.
        obs_tensors = [
            tf.concat([o, p], 0) for o, p in zip(obs_tensors, residual_obs_pad)
        ]
        targ_tensors = [
            tf.concat([t, p], 0) for t, p in zip(targ_tensors, residual_targ_pad)
        ]

        ### INCLUDE THE NOISED ONE
        targ_noise_tensors = [
            tf.concat([t, p], 0) for t, p in zip(targ_noise_tensors, residual_targ_pad)
        ]
        ###

        mask_tensors = [
            tf.concat([m, p], 0) for m, p in zip(mask_tensors, residual_mask_pad)
        ]

        # Concatenate each batch element into a single tensor.
        obs = tf.reshape(tf.concat(obs_tensors, 1), obs_batch_shape)
        targ = tf.reshape(tf.concat(targ_tensors, 1), targ_batch_shape)

        ### INCLUDE THE NOISED ONE
        targ_noise = tf.reshape(tf.concat(targ_noise_tensors, 1), targ_batch_shape)
        ###

        mask = tf.transpose(
            tf.reshape(tf.concat(mask_tensors, 0), mask_batch_trans_shape))
        # return the collection including the noised one
        return DatasetNoisedTensors(obs, targ, targ_noise, mask, distortion)

    def to_human_readable(self, data, model_output=None, whole_batch=False, with_distortion=False, 
                          with_noise=False):
        obs = data.observations
        # it has to denormalize the value associated with the channel for the number of repetitions
        # that is why it takes the last row
        unnormalised_num_reps_flag = self._unnormalise(obs[:, :, -1:]).round()
        # rebuild the original one with the unormalized values
        obs = np.concatenate([obs[:, :, :-1], unnormalised_num_reps_flag], axis=2)
        data = data._replace(observations=obs)
        return custom_bit_string_readable(data, self.batch_size, model_output, whole_batch, with_noise, 
                                          with_distortion)

#### train_noise.py
Here we show our modified version of **train.py** which exploits the modified version of repeat_copy.py to perform the training task.

In [54]:
# Model parameters
HIDDEN_SIZE = 64
MEMORY_SIZE = 16
WORD_SIZE = 16
NUM_WRITE_HEADS = 1
NUM_READ_HEADS = 4
CLIP_VALUE = 20

# Optimizer parameters.
MAX_GRAD_NORM = 50
LEARN_RATE = 1e-4
OPTIMIZER_EPSY = 1e-10

# Task parameters
BATCH_SIZE = 16
NUMB_BITS = 4
MIN_LENGTH = 1
MAX_LENGTH = 2
MIN_REPEATS = 1
MAX_REPEATS = 2

# Training options.
TRAINING_ITER = 1000
REPORT_INTERV = 100
CHECHK_POINT_DIR = "/tmp/tf/dnc"
CHECK_POINT_INTERVAL = -1

In [60]:
from dnc import DNC

class Training():
    
    def __init__(self, hidden_size, memory_size, word_size, num_write_heads, num_read_heads, clip_value, 
                max_grad_norm, learning_rate, optimizier_epsy, batch_size, numb_bits, min_length, max_length, 
                min_repeats, max_repeats):
        
        self._hidden_size = hidden_size 
        self._memory_size = memory_size
        self._word_size = word_size
        self._num_write_heads = num_write_heads
        self._num_read_heads = num_read_heads
        self._clip_value = clip_value

        self._max_grad_norm = max_grad_norm
        self._learning_rate = learning_rate
        self._optimizier_epsy = optimizier_epsy

        self._batch_size = batch_size
        self._numb_bits = numb_bits
        self._min_length = min_length
        self._max_length = max_length
        self._min_repeats = min_repeats
        self._max_repeats = max_repeats
  

    def run_model(self, input_sequence, output_size):
        """Runs model on input sequence."""

        access_config = {
            "memory_size": self._memory_size,
            "word_size": self._word_size,
            "num_reads": self._num_read_heads,
            "num_writes": self._num_write_heads,
        }
        controller_config = {
            "hidden_size": self._hidden_size,
        }
        clip_value = self._clip_value

        dnc_core = DNC(access_config, controller_config, output_size, clip_value)
        initial_state = dnc_core.initial_state(self._batch_size)

        output_sequence, _ = tf.nn.dynamic_rnn(
            # instance of a RNN core module
            cell=dnc_core,
            inputs=input_sequence,
            time_major=True,
            initial_state=initial_state)

        return output_sequence


    def train(self, num_training_iterations, report_interval, checkpoint_dir='/tmp/tf/dnc', checkpoint_interval=-1, 
              with_distorsion=False, with_noise=False, noise_level=None, verbosity=0):
        """Trains the DNC and periodically reports the loss."""


        #  dataset definition
        dataset_noise = NoisedRepeatCopy(self._numb_bits, self._batch_size,
                                                           self._min_length, self._max_length,
                                                           self._min_repeats, self._max_repeats,
                                                           noise_level=noise_level)

        dataset_tensors = dataset_noise()

        #  compute the output of the RNN
        output_logits = self.run_model(dataset_tensors.observations, dataset_noise.target_size)
        # Used for visualization
        output = tf.round(
            tf.expand_dims(dataset_tensors.mask, -1) * tf.sigmoid(output_logits))

        #  compute the loss with the respect of the target
        train_loss = dataset_noise.cost(output_logits, dataset_tensors.target_noise,
                                        dataset_tensors.mask)

        # Set up optimizer with global norm clipping.
        trainable_variables = tf.trainable_variables()

        #  compute the gradient
        grads, _ = tf.clip_by_global_norm(
            tf.gradients(train_loss, trainable_variables), self._max_grad_norm)

        # define a global variable
        global_step = tf.get_variable(
            name="global_step",
            shape=[],
            dtype=tf.int64,
            initializer=tf.zeros_initializer(),
            trainable=False,
            collections=[tf.GraphKeys.GLOBAL_VARIABLES, tf.GraphKeys.GLOBAL_STEP])

        #  run the backward gradient propagation
        optimizer = tf.train.RMSPropOptimizer(
            self._learning_rate, epsilon=self._optimizier_epsy)
        # GLOBAL -> this is incremented by one after the minimization (backward) have been executed
        train_step = optimizer.apply_gradients(
            zip(grads, trainable_variables), global_step=global_step)

        saver = tf.train.Saver()

        if checkpoint_interval > 0:
            hooks = [
                tf.train.CheckpointSaverHook(
                    checkpoint_dir=checkpoint_dir,
                    save_steps=checkpoint_interval,
                    saver=saver)
            ]
        else:
            hooks = []

        with tf.train.SingularMonitoredSession(
                hooks=hooks, checkpoint_dir=checkpoint_dir) as sess:

            start_iteration = sess.run(global_step)
            tf.logging.info(" - Training has started!")

            results = {
                'noise_level': NOISE_LEVEL,
                'iteration': [],
                'outputs': [],
                'data': [],
                'strings': [],
                'losses': []
            }

            total_loss = 0
            for train_iteration in range(start_iteration, num_training_iterations):
                _, loss = sess.run([train_step, train_loss])
                total_loss += loss

                if (train_iteration + 1) % report_interval == 0:
                    dataset_tensors_np, output_np = sess.run([dataset_tensors, output])
                    dataset_string = dataset_noise.to_human_readable(dataset_tensors_np, output_np,
                                                                     with_distortion=with_distorsion,
                                                                     with_noise=with_noise)

                    results['iteration'].append(train_iteration)
                    results['data'].append(dataset_tensors_np)
                    results['strings'].append(dataset_string)
                    results['outputs'].append(output_np)
                    results['losses'].append(total_loss / report_interval)

                    if verbosity is 0:
                        tf.logging.info(" - Iter: %d: Avg training loss %f.",
                                        train_iteration, total_loss / report_interval)
                    else:
                        tf.logging.info(" - Iter: %d: Avg training loss %f.\n%s",
                                        train_iteration, total_loss / report_interval,
                                        dataset_string)
                    total_loss = 0

        return results

In [56]:
NOISE_LEVEL = None
TRAINING_ITER = 100000
REPORT_INTERV = 5000

training = Training(HIDDEN_SIZE, MEMORY_SIZE, WORD_SIZE, NUM_WRITE_HEADS, NUM_READ_HEADS, 
                   CLIP_VALUE, MAX_GRAD_NORM, LEARN_RATE, OPTIMIZER_EPSY, BATCH_SIZE, 
                   NUMB_BITS, MIN_LENGTH, MAX_LENGTH, MIN_REPEATS, MAX_REPEATS)

tf.reset_default_graph()
tf.logging.set_verbosity(3)  # Print INFO log messages.
results_clean = training.train(TRAINING_ITER, REPORT_INTERV, with_noise=False, with_distorsion=False, 
                               noise_level=NOISE_LEVEL)

Tensor("rnn/TensorArrayStack/TensorArrayGatherV3:0", shape=(?, 16, 5), dtype=float32)
Tensor("repeat_copy/Reshape_50:0", shape=(?, 16, 5), dtype=float32)
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow: - Training has started!
INFO:tensorflow: - Iter: 4999: Avg training loss 7.938283.
INFO:tensorflow: - Iter: 9999: Avg training loss 4.396687.
INFO:tensorflow: - Iter: 14999: Avg training loss 2.364800.
INFO:tensorflow: - Iter: 19999: Avg training loss 1.378604.
INFO:tensorflow: - Iter: 24999: Avg training loss 0.773323.
INFO:tensorflow: - Iter: 29999: Avg training loss 0.333952.
INFO:tensorflow: - Iter: 34999: Avg training loss 0.185517.
INFO:tensorflow: - Iter: 39999: Avg training loss 0.127265.
INFO:tensorflow: - Iter: 44999: Avg training loss 0.096554.
INFO:tensorflow: - Iter: 49999: Avg training loss 0.080684.
INFO:tensorflow: - Iter: 54999: Avg training loss 0.060898.
INFO:tensorflow: - Iter: 59

In [61]:
NOISE_LEVEL = 85.
TRAINING_ITER = 100000
REPORT_INTERV = 5000

training = Training(HIDDEN_SIZE, MEMORY_SIZE, WORD_SIZE, NUM_WRITE_HEADS, NUM_READ_HEADS, 
                   CLIP_VALUE, MAX_GRAD_NORM, LEARN_RATE, OPTIMIZER_EPSY, BATCH_SIZE, 
                   NUMB_BITS, MIN_LENGTH, MAX_LENGTH, MIN_REPEATS, MAX_REPEATS)

tf.reset_default_graph()
tf.logging.set_verbosity(3)  # Print INFO log messages.
results_noise = training.train(TRAINING_ITER, REPORT_INTERV, with_noise=False, with_distorsion=False, 
                               noise_level=NOISE_LEVEL)

Tensor("rnn/TensorArrayStack/TensorArrayGatherV3:0", shape=(?, 16, 5), dtype=float32)
Tensor("repeat_copy/Reshape_50:0", shape=(?, 16, 5), dtype=float32)
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow: - Training has started!
INFO:tensorflow: - Iter: 4999: Avg training loss 8.309622.
INFO:tensorflow: - Iter: 9999: Avg training loss 5.732491.
INFO:tensorflow: - Iter: 14999: Avg training loss 3.877566.
INFO:tensorflow: - Iter: 19999: Avg training loss 3.107155.
INFO:tensorflow: - Iter: 24999: Avg training loss 2.822140.
INFO:tensorflow: - Iter: 29999: Avg training loss 2.637930.
INFO:tensorflow: - Iter: 34999: Avg training loss 2.483084.
INFO:tensorflow: - Iter: 39999: Avg training loss 2.352814.
INFO:tensorflow: - Iter: 44999: Avg training loss 2.306345.
INFO:tensorflow: - Iter: 49999: Avg training loss 2.259139.
INFO:tensorflow: - Iter: 54999: Avg training loss 2.233495.
INFO:tensorflow: - Iter: 59

This *matrix* shows the number of changed bits. It has to be red as follow: there is one row per iteration saved and one column for each batch. Take into account that for each iteration 16 batched are generated, in this case only one has been affected by noise

In [84]:
for indx in range(len(results_noise['data'])):
    print(results_noise['data'][indx][4])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 16.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
[0.0, 0.0, 0.0, 0.0, 0.

Here instead we show the result for each saved iteration. It's possible to observe the observation, which is the input of the DNC and the model output with the respect of the target. To be noticed, the last two rows in the observation are the termination channel and the repetition channel respectively. For the model output the last row is the termination channel. Those channel are not part of the array to be copied but just extra information passed to the DNC conroller.

In [83]:
for string, iteration, loss in zip(results_noise['strings'], 
                                   results_noise['iteration'],
                                   results_noise['losses']):
    print('>>>> Iteration: ' + str(iteration))
    print(string)
    print('>>>> Losses: '+ str(loss))
    

>>>> Iteration: 4999

Observations:
+- 1 - - - - - - -+
+- 1 - - - - - - -+
+- 1 - - - - - - -+
+- - - - - - - - -+
+1 - - - - - - - -+
+- - 2 - - - - - -+

Targs:
+- - - 1 1 - - - -+
+- - - 1 1 - - - -+
+- - - 1 1 - - - -+
+- - - - - - - - -+
+- - - - - 1 - - -+

Model Output:
+- - - 1 - - - - -+
+- - - 1 - - - - -+
+- - - 1 1 - - - -+
+- - - - - - - - -+
+- - - - - 1 - - -+
>>>> Losses: 8.309621811771393
>>>> Iteration: 9999

Observations:
+- 1 - - - - - - -+
+- 1 - - - - - - -+
+- - - - - - - - -+
+- 1 - - - - - - -+
+1 - - - - - - - -+
+- - 1 - - - - - -+

Targs:
+- - - 1 - - - - -+
+- - - 1 - - - - -+
+- - - - - - - - -+
+- - - 1 - - - - -+
+- - - - 1 - - - -+

Model Output:
+- - - 1 - - - - -+
+- - - 1 - - - - -+
+- - - - - - - - -+
+- - - 1 - - - - -+
+- - - - 1 - - - -+
>>>> Losses: 5.732491242980957
>>>> Iteration: 14999

Observations:
+- 1 1 - - - - - -+
+- - - - - - - - -+
+- 1 - - - - - - -+
+- 1 - - - - - - -+
+1 - - - - - - - -+
+- - - 1 - - - - -+

Targs:
+- - - - 1 1 - 

#### Comments
It's possible to see how the noised version is slower in converge to a zero error. We should look for longer training iterations in order to see how much it will improve and if this has some noticeble effects on the copy task. Indeed, for now, even in the case of some noise, the DNC was able to copy the vector without errors in the observed iterations. Further, we should investigate different velues of the noise and different parameter of the task (like increasing the size of the array).