# Building & reading data sets
Since the [FFBP package](https://github.com/alex-ten/pdpyflow/tree/master/FFBP) runs [Tensorflow](https://www.tensorflow.org/), you need to make input data readable for the underlying computational graph. FFBP has a special class called `InputData` that lets you easily convert a csv data file into a structure that can be injected into a Tensorflow graph.

The csv file must be structured in a certain way (see example below). The very first row is reserved for column labels; it is there for convenience and in effect will be ignored. Each subsequent row contains a labeled input pattern. Each row is organized into a sequence of entries, whereby the first entry contains the pattern label, and the following entries encode the actual input/target data. A sequence of input data comes first, followed by a sequence of target data. The example below is from the training data for this tutorial's 8-3-8 auto-encoder network (note that spaces were added for readability):

```
inp_label,  input,             target
p1,         1,0,0,0,0,0,0,0,   1,0,0,0,0,0,0,0
p2,         0,1,0,0,0,0,0,0,   0,1,0,0,0,0,0,0
p3,         0,0,1,0,0,0,0,0,   0,0,1,0,0,0,0,0
p4,         0,0,0,1,0,0,0,0,   0,0,0,1,0,0,0,0
p5,         0,0,0,0,1,0,0,0,   0,0,0,0,1,0,0,0
p6,         0,0,0,0,0,1,0,0,   0,0,0,0,0,1,0,0
p7,         0,0,0,0,0,0,1,0,   0,0,0,0,0,0,1,0
p8,         0,0,0,0,0,0,0,1,   0,0,0,0,0,0,0,1
```
**Importantly, each row must end with an implicit newline character "`\n`", not a comma "`,`"**.

# Reading csv data file
It is straightforward to create an `InputData` instance from the data file organized correctly. You need to specify a few parameters to make it work:
- **`num_epochs`** : the number of epochs you are planning to train/test the model for. An epoch is a single iteration inside which a network processes the entire data set (typically seeing each pattern exactly once).
- **`batch_size`** : the size of a subset (mini-batch) of the data set. Together with `data_len`, this parameter determines the number of weight updates per epoch of training. Specifically, within a single epoch of training, weights will be updated `data_len / batch_size` times. For example, if `batch_size==data_len`, the network will accumulate gradients for each training pattern and perform a weight update once per epoch. In contrast, if `batch_size==1`, each training pattern will cause a weight update, making the order of training examples consequential. Moreover, if `(batch_size>1 & batch_size<data_len)`, several training patterns will cause a weight update and there will be more than one update per epoch. Thus, `batch_size` must divide `data_len`, that is, the remainder of `data_len / batch_size` must be zero.
- **`data_len`** : the number of input patterns in the data set. Together with `batch_size`, this parameter determines the number of weight updates per epoch of training (see `batch_size`).
- **`inp_size`** : the number of input data points (same as the size of input layer).
- **`targ_size`** : the number of target data points (same as the size of output layer).
-  **`shuffle_seed`** : (optional, *default*=`None`) the seed for a random number generator that dictates the shuffling of input patterns. If `None`, input patterns will be fed in the same order (top to bottom) as they appear in the csv file. If negative (e.g. `shuffle_seed=-1`), the seed will be generated at random and shuffling will be intractable.

In the example below, we set up training and testing data for a training loop simulation. Note that **for testing, we set the `batch_size=DATA_LEN`** to make testing more efficient. This will feed the entire data set in a single batch, but each element of the batch will be processed separately. Also note that we don't want to shuffle testing patterns (so ommit the `shuffle_seed` parameter or set it to `None`).

Feel free to change the parameters and observe the effects in the output.

In [10]:
import tensorflow as tf
import FFBP
tf.logging.set_verbosity(tf.logging.ERROR) # Prevent unwanted logging messages by tensorflow

NUM_EPOCHS = 2
BATCH_SIZE = 4
INP_SIZE = 8
TARG_SIZE = 8
DATA_LEN = 8
SHUFFLE = 1

FFBP_GRAPH = tf.Graph()

with FFBP_GRAPH.as_default():
    
    # Create data for training
    TRAIN_DATA = FFBP.InputData(
            path_to_data_file = 'auto_data.txt',
            num_epochs = NUM_EPOCHS,
            batch_size = BATCH_SIZE,
            data_len = DATA_LEN,
            inp_size = INP_SIZE, 
            targ_size = TARG_SIZE,
            shuffle_seed = SHUFFLE,
        )

    # Create data for testing
    TEST_DATA = FFBP.InputData(
        path_to_data_file = 'auto_data.txt',
        num_epochs = NUM_EPOCHS,
        batch_size = DATA_LEN,
        inp_size = INP_SIZE, 
        targ_size = TARG_SIZE,
        data_len = DATA_LEN,
    )

# Simulate training loop
with tf.Session(graph=FFBP_GRAPH) as sess:
        # Initialize variables
        sess.run(tf.local_variables_initializer())
        sess.run(tf.global_variables_initializer())
        
        # create coordinator and start queue runners
        coordinator = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coordinator)

        for i in range(NUM_EPOCHS):
            print('\nEPOCH {}:\n'.format(i))
            
            # Mock-test model in a test mini-loop
            print('  TESTING:')
            test_examples = sess.run(TEST_DATA.examples_batch)
            for j, example in enumerate(zip(*test_examples)):
                print('    testing pattern {}: \'{}\' {}'.format(j, example[0].decode('UTF-8'),example[1]))

            # Mock-train model in a train mini-loop
            print('  TRAINING:')
            num_updates = TRAIN_DATA.data_len // TRAIN_DATA.batch_size
            for k in range(num_updates):
                examples_batch = sess.run(TRAIN_DATA.examples_batch)
                print(
                    '    processing mini-batch {}/{}: {}'.format(
                        k+1, num_updates, 
                        [x.decode('UTF-8') for x in examples_batch[0]]
                    )
                )
#                 print('      {}'.format(examples_batch[1]))
                print('\t' + str(examples_batch[1]).replace('\n', '\n\t'))

        coordinator.request_stop()
        coordinator.join(threads)


EPOCH 0:

  TESTING:
    testing pattern 0: 'p1' [ 1.  0.  0.  0.  0.  0.  0.  0.]
    testing pattern 1: 'p2' [ 0.  1.  0.  0.  0.  0.  0.  0.]
    testing pattern 2: 'p3' [ 0.  0.  1.  0.  0.  0.  0.  0.]
    testing pattern 3: 'p4' [ 0.  0.  0.  1.  0.  0.  0.  0.]
    testing pattern 4: 'p5' [ 0.  0.  0.  0.  1.  0.  0.  0.]
    testing pattern 5: 'p6' [ 0.  0.  0.  0.  0.  1.  0.  0.]
    testing pattern 6: 'p7' [ 0.  0.  0.  0.  0.  0.  1.  0.]
    testing pattern 7: 'p8' [ 0.  0.  0.  0.  0.  0.  0.  1.]
  TRAINING:
    processing mini-batch 1/2: ['p8', 'p5', 'p6', 'p3']
	[[ 0.  0.  0.  0.  0.  0.  0.  1.]
	 [ 0.  0.  0.  0.  1.  0.  0.  0.]
	 [ 0.  0.  0.  0.  0.  1.  0.  0.]
	 [ 0.  0.  1.  0.  0.  0.  0.  0.]]
    processing mini-batch 2/2: ['p4', 'p1', 'p7', 'p2']
	[[ 0.  0.  0.  1.  0.  0.  0.  0.]
	 [ 1.  0.  0.  0.  0.  0.  0.  0.]
	 [ 0.  0.  0.  0.  0.  0.  1.  0.]
	 [ 0.  1.  0.  0.  0.  0.  0.  0.]]

EPOCH 1:

  TESTING:
    testing pattern 0: 'p1' [ 1.  0.  0.  0.  