# Tensorflow Pipeline

The pipeline will load the data in batch, or small chunk. Each batch will be pushed to the pipeline and be ready for the training. Building a pipeline is an excellent solution because it allows you to use parallel computing. It means Tensorflow will train the model across multiple CPUs. It fosters the computation and permits for training powerful neural network.

# Steps to create a pipeline

## Load the data

In [3]:
#here we will use numpy to generate arbitary data
import numpy as np

x_input = np.random.sample((3,4)) #data dimension is 3x4
print(x_input)

[[0.62460588 0.7553646  0.88575139 0.55813575]
 [0.47794668 0.94377006 0.63552177 0.56810112]
 [0.86221862 0.73088896 0.36357698 0.73940885]]


## Create placeholders

create the place holders to hold the data while running the pipeline

In [1]:
import tensorflow.compat.v1 as tf
tf.disable_eager_execution()

## Define the dataset

<b>Note:It is strongly advised to not use feed_dict to provide data while running a session.</b><br>

This is where `dataset` and `iterator` come in picture.Datasets can be generated using multiple type of data sources like Numpy, TFRecords, text files, CSV files etc.<br>


Dataset can be created in multiple ways:
<ul>
    <li><b><u>.from_tensor_slices()</u></b>: This method accepts individual (or multiple) Numpy (or Tensors) objects.In case you are feeding multiple objects, pass them as tuple and make sure that all the objects have same size in zeroth dimension.</li>
    <li><b><u>.from_tensors()</u></b>: Just like from_tensor_slices, this method also accepts individual (or multiple) Numpy (or Tensors) objects. But this method doesn’t support batching of data, i.e all the data will be given out instantly. As a result, you can pass differently sized inputs at zeroth dimension if you are passing multiple objects. This method is useful in cases where dataset is very small or your learning model needs all the data at once.</li>
    <li><b><u>.from_generator()</u></b>: In this method, a generator function is passed as input. This method is useful in cases where you wish to generate the data at runtime and as such no raw data exists with you or in scenarios where your training data is extremely huge and it is not possible to store them in your disk. I would strongly encourage people to <b>not use</b> this method for the purpose of generating data augmentations.</li>
</ul>

<p>Once the dataset has been created we can apply different kinds of transformations like batch, repeat, shuffle, map or filter.</p>


We need to define the Dataset where we can populate the value of the placeholder x. We need to use the method `tf.data.Dataset.from_tensor_slices`<br>
<b>from_tensor_slices</b>: This method accepts individual (or multiple) Numpy (or Tensors) objects. In case you are feeding multiple objects, pass them as tuple and make sure that all the objects have same size in zeroth dimension.

In [4]:
dataset = tf.data.Dataset.from_tensor_slices(x_input)

## Create the pipeline

We need to initialize the pipeline where the data will flow. We need to create an iterator with `make_initializable_iterator`. We name it iterator. Then we need to call this iterator to feed the next batch of data, `get_next`. We name this step get_next. Note that in our example, there is only one batch of data<br>

Tensorflow has provided four types of iterators and each of them has a specific purpose and use-case behind it.
<ul>
    <li><b><u>one_shot_iterator</u></b>:One-shot iterator will iterate through all the elements present in Dataset and once exhausted, cannot be used anymore.</li>
    <li><b><u>initializable</u></b>:In One-shot iterator, we had the shortfall of repetition of same training dataset in memory and there was absence of periodically validating our model using validation dataset in our code. In initializable iterator we overcome these problems. Initializable iterator has to be initialized with dataset before it starts running.</li>
    <li><b><u>reinitializable</u></b>:In initializable iterator, there was a shortfall of different datasets undergoing the same pipeline before the Dataset is fed into the iterator. This problem is overcome by reinitializable iterator as we have the ability to feed different types of Datasets thereby undergoing different pipelines. Only one care has to be taken is that different Datasets are of the same data type.</li>
    <li><b><u>feedable</u></b>:The reinitializable iterator gave the flexibility of assigning differently pipelined Datasets to iterator, but the iterator was inadequate to maintain the state (i.e till where the data has been emitted by individual iterator).</li>
</ul>

Regardless of the type of iterator, `get_next` function of iterator is used to create an operation in your Tensorflow graph which when run over a session, returns the values from the fed Dataset of iterator. Also, iterator doesn’t keep track of how many elements are present in the Dataset. Hence, it is normal to keep running the iterator’s get_next operation till Tensorflow’s `tf.errors.OutOfRangeError` exception is occurred.

In [5]:
iterator = tf.data.make_initializable_iterator(dataset)
get_next = iterator.get_next()
print(get_next)

Tensor("IteratorGetNext:0", shape=(4,), dtype=float64)


## Execute the Operation

We initiate a session, and we run the operation iterator. We feed the feed_dict with the value generated by numpy. These two value will populate the placeholder x. Then we run get_next to print the result.

In [6]:
with tf.Session() as sess:
    sess.run(iterator.initializer)
    try:
        while True:
            print(sess.run(get_next))
    except tf.errors.OutOfRangeError:
        print('---Finished Execution---')

[0.62460588 0.7553646  0.88575139 0.55813575]
[0.47794668 0.94377006 0.63552177 0.56810112]
[0.86221862 0.73088896 0.36357698 0.73940885]
---Finished Execution---
