# Reading data ... the TensorFlow way 

Queues can be used to feed data into your TensorFlow graph. This allows you to deal with large data sets, perhaps stored in different files. This notebook is based on the official tutorials 
[Reading Data](https://www.tensorflow.org/programmers_guide/reading_data) and
[Threading and Queues](https://www.tensorflow.org/programmers_guide/threading_and_queues), which contain much more information. 

In [None]:
import tensorflow as tf

We have a queue containing all files containing the data (here we consider just a single file):

In [None]:
filename_queue = tf.train.string_input_producer([ "../data/LSDA2017WeedCropTrain.csv" ])

A `TextLineReader` allows us to read from a plain text file:

In [None]:
reader = tf.TextLineReader(skip_header_lines = 0)

And the `TextReader` should read from the files in our queue:

In [None]:
key, value = reader.read(filename_queue)

Queues are filled in extra threads. These are started by `tf.train.start_queue_runners`.

For dealing with threads with make use of an `tf.train.Coordinator` object, which coordinates the termination of a set of threads. A coordinator can  ask all threads it coordinates to stop via `tf.train.Coordinator.request_stop()`. Calling `tf.train.Coordinator.join` waits until the registered threads have terminated.

In [None]:
with tf.Session() as sess:
    coord = tf.train.Coordinator() # Thread coordinator
    threads = tf.train.start_queue_runners(coord = coord) # Start queue runners, returns the corresponding threads
    for i in range(1,10):
        print(sess.run([key, value]))
    coord.request_stop() # Ask the threads to stop
    coord.join(threads)  # Wait until threads have stopped

After the last pattern is read, the first pattern is returned again.

The `tf.TextLineReader` returns strings. Now we convert these strings into numerical data.

In [None]:
d = 13 # input dimension
record_defaults = [[1.0] for _ in range(d)] # define all input features to be floats
record_defaults.append([1]) # add the label as an integer
print(record_defaults)

In [None]:
content = tf.decode_csv(value, record_defaults)

In [None]:
with  tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord = coord)
    for i in range(1,10):
        print(sess.run(content))
    coord.request_stop()
    coord.join(threads)

Now we split the content into features and label:

In [None]:
content = tf.decode_csv(value, record_defaults)
# pack all d features into a tensor
features = tf.stack(content[:d])
# assign the last column to label
label = content[-1]

In [None]:
with tf.Session() as sess:
    coord  =  tf . train . Coordinator ()
    threads = tf.train.start_queue_runners (coord = coord)
    for i in range(1,20):
        print(sess.run([features,label]))
    coord.request_stop()
    coord.join(threads)

Now we group the data into mini-batches:

In [None]:
BatchSize = 2
# Minimum number elements in the queue after a dequeue, used to ensure that the samples are sufficiently mixed
# 10 times the BatchSize should be sufficient
min_after_dequeue = 10 * BatchSize
# Maximum number of elements in the queue
capacity = 20 * BatchSize
# Shuffle the data to generate BatchSize sample pairs
data_batch = tf.train.shuffle_batch([features,label], batch_size = BatchSize, capacity = capacity,
                                   min_after_dequeue = min_after_dequeue)

In [None]:
with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord = coord)
    for i in range(1,100):
        print(sess.run(data_batch))
    coord.request_stop()
    coord.join(threads)
