## Tensorflow Dataset API

Within this notebook I aim to get a better understanding of the workings of tf dataset vs using feed-dict. Dataset would ensure that the GPU would never have to wait for new stuff to come in.

Generally, need to follow 3 steps:
1. **Import the data**: create dataset instance for some data
2. **Create an iterator**: make an iterator form the creater dataset
3. **Consuming the dataset**: from iterator we get the data that we need for our models


In [1]:
import tensorflow as tf
import numpy as np

  from ._conv import register_converters as _register_converters


### Import the data

Below will be ways to import data into tensorflow form different locations

**Numpy**

Can pass both a single or multiple numpy datasets

In [8]:
##single vector (random)
x = np.random.sample((50,2))

#make dataset
dataset = tf.data.Dataset.from_tensor_slices(x)

In [9]:
dataset

<TensorSliceDataset shapes: (2,), types: tf.float64>

In [10]:
it = dataset.make_one_shot_iterator()
el = it.get_next()

with tf.Session() as sess:
    print(sess.run(el))

[0.04755973 0.46445782]


Multiple numpy vectors

In [11]:
features, labels = (np.random.sample((100,2)), np.random.sample((100,1)))
dataset = tf.data.Dataset.from_tensor_slices((features,labels))

In [12]:
it = dataset.make_one_shot_iterator()
el = it.get_next()

with tf.Session() as sess:
    print(sess.run(el))

(array([0.00995442, 0.8160845 ]), array([0.37720206]))


**Tensors**

In [16]:
tensor = tf.random_uniform([100,2])
dataset = tf.data.Dataset.from_tensor_slices(tensor)

In [19]:
it = dataset.make_initializable_iterator()
el = it.get_next()

with tf.Session() as sess:
    #need to initialize iterator
    sess.run(it.initializer) #important!!!!!
    print(sess.run(el))

[0.2864604 0.7926196]


From above we can note that have to run the initializer!!! **ALWAYS REMEMEBR**

**Placeholder**

Useful when you want to dynamically change data inside a dataset

In [21]:
x = tf.placeholder(tf.float32, shape=[None,2])
dataset = tf.data.Dataset.from_tensor_slices(x)

data = np.random.sample((100,2))

it = dataset.make_initializable_iterator()
el = it.get_next()

with tf.Session() as sess:
    sess.run(it.initializer, feed_dict={x: data})
    print(sess.run(el))
    

[0.95630896 0.69142455]


**Generator**

Useful when we have array of different elements length - like a sequence

In [25]:
sequence = np.array([[[1]], [[2],[3]], [[3],[4],[5]]])

#create generator
def generator():
    for el in sequence:
        yield el

dataset = tf.data.Dataset.from_generator(generator,
                                           output_types= tf.int64, output_shapes=(tf.TensorShape([None, 1])))
it = dataset.make_initializable_iterator()
el = it.get_next()

with tf.Session() as sess:
    sess.run(it.initializer)
    print(sess.run(el))
    print(sess.run(el))
    print(sess.run(el))

[[1]]
[[2]
 [3]]
[[3]
 [4]
 [5]]


### Create Iterator

There are different types of iterators that can be used which will allow us to retrieve values of our data.

* **One Shot**: iterate once through the dataset and cannot feed anything to it
* **Initalizable**: can dynamically feed in data with feed_dict
* **Reinitializable**: can be initialized from different Dataset. Useful when training dataset may undergo transformations
* **Feedable**: used to select which iterator to use

**Initializable**

In [30]:
EPOCHS = 10

x = tf.placeholder(tf.float32, shape=[None,2])
y = tf.placeholder(tf.float32, shape=[None,1])
dataset = tf.data.Dataset.from_tensor_slices((x,y))

train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.array([[1,2]]), np.array([[0]]))

it = dataset.make_initializable_iterator()
feats, labels = it.get_next()

with tf.Session() as sess:
    sess.run(it.initializer, feed_dict={x:train_data[0], y:train_data[1]})
    for i in range(EPOCHS):
        sess.run([feats, labels])
    #switch to test
    sess.run(it.initializer, feed_dict={x:test_data[0], y:test_data[1]})
    print(sess.run([feats, labels]))
    

[array([1., 2.], dtype=float32), array([0.], dtype=float32)]


**Reinitializable**

We are switching *between DATASETS*

In [31]:
EPOCHS=10

#train_test data
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.random.sample((10,2)), np.random.sample((10,1)))

#create datasets
train_dataset = tf.data.Dataset.from_tensor_slices(train_data)
test_dataset = tf.data.Dataset.from_tensor_slices(test_data)

#!!!!CREATE GENERIC ITERATOR!!!!
it = tf.data.Iterator.from_structure(train_dataset.output_types, train_dataset.output_shapes)
feats, labels = it.get_next()
#crete initialisation operations
train_init_op = it.make_initializer(train_dataset)
test_init_op = it.make_initializer(test_dataset)

with tf.Session() as sess:
    sess.run(train_init_op) #run train initializer
    for _ in range(EPOCHS):
        sess.run([feats, labels])
    #switch to test
    sess.run(test_init_op)
    print(sess.run([feats,labels]))


[array([0.96644433, 0.79570336]), array([0.74923463])]


**Feedable**

We are switching *between ITERATORS*

In [36]:
EPOCHS = 10

#my data
train_data = (np.random.sample((100,2)), np.random.sample((100,1)))
test_data = (np.random.sample((10,2)), np.random.sample((10,1)))

#pplaceholders
x = tf.placeholder(tf.float32, shape=[None,2])
y = tf.placeholder(tf.float32, shape=[None,1])

#create datsets
train_dataset = tf.data.Dataset.from_tensor_slices((x,y))
test_dataset = tf.data.Dataset.from_tensor_slices((x,y))

#we could also have had one shot ( in that case no placeholders needed)
train_it = train_dataset.make_initializable_iterator()
test_it = test_dataset.make_initializable_iterator()

*handle*: placeholder which can be dynamically changed - allows us to switch between the different iterators which have been defined above

In [37]:
handle = tf.placeholder(tf.string, shape=[])

#create generic iterator - this will allow us to switch between iterators
#previously the generic iterator allowed us to switch between datasets
it = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
els = it.get_next()

with tf.Session() as sess:
    train_handle = sess.run(train_it.string_handle())
    test_handle = sess.run(test_it.string_handle())
    
    #initialize the different iterators
    sess.run(train_it.initializer, feed_dict={x: train_data[0], y:train_data[1]})
    sess.run(test_it.initializer, feed_dict={x: test_data[0], y: test_data[1]})
    
    for _ in range(EPOCHS):
        x,y = sess.run(els, feed_dict={handle: train_handle})
        print(x,y)
    x,y = sess.run(els, feed_dict={handle: test_handle})
    print(x,y)
    

(array([0.48809168, 0.5892124 ], dtype=float32), array([0.71873456], dtype=float32))
(array([0.39999062, 0.8828608 ], dtype=float32), array([0.2170473], dtype=float32))
(array([0.92552376, 0.33964977], dtype=float32), array([0.75877225], dtype=float32))
(array([0.33813456, 0.25078583], dtype=float32), array([0.04937245], dtype=float32))
(array([0.945379  , 0.38982236], dtype=float32), array([0.32936984], dtype=float32))
(array([0.9840039 , 0.91464275], dtype=float32), array([0.61121887], dtype=float32))
(array([0.4283472 , 0.05491492], dtype=float32), array([0.95066553], dtype=float32))
(array([0.9981808 , 0.26911464], dtype=float32), array([0.1444552], dtype=float32))
(array([0.28101063, 0.5318586 ], dtype=float32), array([0.41510203], dtype=float32))
(array([0.0916037, 0.4734865], dtype=float32), array([0.86678237], dtype=float32))
(array([0.9377489, 0.9165799], dtype=float32), array([0.9074531], dtype=float32))


**Consuming the data**

In the following code certain new elements are used.
* ` batch ` : batches data with provided size
* `repeat`: specifies number of times dataset has to be repeated. If no argument is passed it will run forever. This is ideal as it can then be tweaked by the number of epochs we are choosing

In [39]:
EPOCHS = 10
BATCH_SIZE = 16
# using two numpy arrays
features, labels = (np.array([np.random.sample((100,2))]), 
                    np.array([np.random.sample((100,1))])) #these are wrapped in another array b/c needed for batching
dataset = tf.data.Dataset.from_tensor_slices((features,labels)).repeat().batch(BATCH_SIZE) 
iter = dataset.make_one_shot_iterator() #nothing is feeded inside
x, y = iter.get_next()

# make a simple model - v small nn
net = tf.layers.dense(x, 8, activation=tf.tanh) # pass the first value from iter.get_next() as input
net = tf.layers.dense(net, 8, activation=tf.tanh)
prediction = tf.layers.dense(net, 1, activation=tf.tanh)
loss = tf.losses.mean_squared_error(prediction, y) # pass the second value from iter.get_net() as label
train_op = tf.train.AdamOptimizer().minimize(loss)


with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(EPOCHS):
        _, loss_value = sess.run([train_op, loss])
        print("Iter: {}, Loss: {:.4f}".format(i, loss_value))

Iter: 0, Loss: 0.2334
Iter: 1, Loss: 0.2265
Iter: 2, Loss: 0.2199
Iter: 3, Loss: 0.2134
Iter: 4, Loss: 0.2071
Iter: 5, Loss: 0.2010
Iter: 6, Loss: 0.1952
Iter: 7, Loss: 0.1895
Iter: 8, Loss: 0.1841
Iter: 9, Loss: 0.1788
