### Loading Libraries

In [1]:
import tensorflow as tf
import numpy as np

In [2]:
num_items = 11
num_list1 = np.arange(num_items)
num_list2 = np.arange(num_items,num_items*2)

In [3]:
num_list1

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [4]:
num_list2

array([11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21])

#### This is how to create datasets, using the from_tensor_slices() method: 

In [5]:
num_list1_dataset = tf.data.Dataset.from_tensor_slices(num_list1)

#### This is how to create an iterator on it using the make_one_shot_iterator() method: 

In [9]:
iterator = tf.compat.v1.data.make_one_shot_iterator(num_list1_dataset)

In [10]:
for item in num_list1_dataset:
    num = iterator.get_next().numpy()
    print(num)

0
1
2
3
4
5
6
7
8
9
10


#### Note that executing this code twice in the same program run will raise an error because we are using a one-shot iterator

#### It is also possible to access the data in batches with batch method. Note that first argument is the number of elements to put in each batch and the second is the self-explanatory drop_remainder argument

In [11]:
num_list1_dataset = tf.data.Dataset.from_tensor_slices(num_list1).batch(3,drop_remainder=False)

In [12]:
iterator = tf.compat.v1.data.make_one_shot_iterator(num_list1_dataset)

In [13]:
for item in num_list1_dataset:
    num = iterator.get_next().numpy()
    print(num)

[0 1 2]
[3 4 5]
[6 7 8]
[ 9 10]


#### There is also a zip method, which is useful for presenting features and labels together

In [22]:
dataset1 = [1,2,3,4,5]
dataset2 = ['a','e','i','o','u']
dataset1 = tf.data.Dataset.from_tensor_slices(dataset1)
dataset2 = tf.data.Dataset.from_tensor_slices(dataset2)
zipped_datasets = tf.data.Dataset.zip((dataset1,dataset2))
iterator = tf.compat.v1.data.make_one_shot_iterator(zipped_datasets)
for item in zipped_datasets:
    num = iterator.get_next()
    print(num)

(<tf.Tensor: shape=(), dtype=int32, numpy=1>, <tf.Tensor: shape=(), dtype=string, numpy=b'a'>)
(<tf.Tensor: shape=(), dtype=int32, numpy=2>, <tf.Tensor: shape=(), dtype=string, numpy=b'e'>)
(<tf.Tensor: shape=(), dtype=int32, numpy=3>, <tf.Tensor: shape=(), dtype=string, numpy=b'i'>)
(<tf.Tensor: shape=(), dtype=int32, numpy=4>, <tf.Tensor: shape=(), dtype=string, numpy=b'o'>)
(<tf.Tensor: shape=(), dtype=int32, numpy=5>, <tf.Tensor: shape=(), dtype=string, numpy=b'u'>)


#### We can concatenate two datasets as following, using the concatenate method

In [25]:
ds1 = tf.data.Dataset.from_tensor_slices([1,2,3,4,5,6,7,8,9,10])
ds2 = tf.data.Dataset.from_tensor_slices([11,12,13,14,15,16,17,18,19,20])
ds3 = ds1.concatenate(ds2)
print(ds3)

<ConcatenateDataset shapes: (), types: tf.int32>


In [30]:
iterator = tf.compat.v1.data.make_one_shot_iterator(ds3)

In [31]:
for i in range(14):
    num = iterator.get_next()
    print(num)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(11, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(13, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)


We can also do away with iterators altogether, as shown below

In [32]:
epochs = 2
for e in range(epochs):
    for item in ds3:
        print(item)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(11, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(13, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(15, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(17, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(19, shape=(), dtype=int32)
tf.Tensor(20, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shap

## Using comma-separated value (CSV) files with datasets

CSV files are a very popular method of storing data. TensorFlow 2 contains flexible methods for dealing with them. The main method here is

tf.data.experimental.CsvDataset

### CSV Example 1

With the following arguments, our dataset will consist of two items taken from each row of the filename file, both of the float type, with the first line of the file ignored and columns 1 and 2 used (column numbering is, of course, 0-based):

In [None]:
filename = ['/size_1000.csv']
record_defaults = [tf.float32] * 2 # Two required float columns
dataset = tf.data.experimental.CsvDataset(filename,record_defaults,header = True, select_cols = [1,2])
for item in dataset:
    print(item)

### CSV Example 2

In this example, and with the following arguments, our dataset will consist of one required float, one optional float with a default value of 0.0, and an int, where there is no header in the CSV file and only columns 1, 2, and 3 are imported:

In [None]:
filename = 'mycsvfile.txt'
record_defaults = [tf.float32,tf.constant([0.0],dtype = tf.float32),tf.int32]
dataset = tf.data.experimental.CsvDataset(filename,record_defaults,header = False, select_cols = [1,2,3])
for item in dataset:
    print(item)

### CSV Example 3

For our final example, our dataset will consist of two required floats and a required string, where the CSV file has a header variable:

In [37]:
filename = 'file1.txt'
record_defaults = [tf.float32,tf.float32,tf.string,]
dataset = tf.data.experimental.CsvDataset(filename,record_defaults,header=False)
for item in dataset:
    print(item[0].numpy(), item[1].numpy, item[2].numpy().decode()) # decode as string is in binary format

## TFRecords

Another popular choice for storing data is the TFRecord format. This is a binary file format. For large files, it is a good choice because binary files take up less disc space, take less time to copy, and can be read very efficiently from the disc. All this can have a significant effect on the efficiency of your data pipeline and, thus, the training time of your model. The format is also optimized in a variety of ways for use with TensorFlow. It is a little complex because data has to be converted into the binary format prior to storage and decoded when read back.

### TFRecord example 1

Because a TFRecord file is a sequence of binary strings, its structure must be specified prior to saving so that it can be properly written and subsequently read back. TensorFlow has two structures for this, tf.train.Example and tf.train.SequenceExample. What you have to do is store each sample of your data in one of these structures, then serialize it, and use tf.python_io.TFRecordWriter to save it to disk.

In the following example, the float array, data, is converted to the binary format and then saved to disc. A feature is a dictionary containing the data that is passed to tf.train.Example prior to serialization and saving. A more elaborate example of this is shown in TFRecord example 2:

In [1]:
import tensorflow as tf
import numpy as np

In [2]:
data = np.array([10.,11.,12.,13.,14.,15.])

In [None]:
def npy_to_tfrecords(fname,data):
    writer = tf.io.TFRecordWriter(fname)
    feature = {}
    feature['data'] = tf.train.Feature(float_list=tf.train.FloatList(value = data))
    example = tf.train.Example(features=tf.train.Features(feature = Features))
    serialized = example.SerializeToString()
    writer.write(serialized)
    writer.close()

The code to read the record back is as follows. A parse_function function is constructed that decodes the dataset read back from the file. This requires a dictionary (keys_to_features) with the same name and structure as the saved data:

In [None]:
dataset = tf.data.TFRecordDataset('./myfile.tfrecords')

def parse_function(example_proto):
    keys_to_features = {'data' : tf.io.FixedLenSequenceFeature([], dtype = tf.float32, allow_missing = True)}
    parsed_features = tf.io.parse_single_example(serialized = example_proto, features = keys_to_features)
    return parsed_geatures['data']

dataset = dataset.map(parse_function)
iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)

# array is retrieved as one item
item = iterator.get_next()

print(item)
print(item.numpy())
print(item[2].numpy())

### TFRecord example 2

In this example, we look at a more complicated record structure given by this dictionary

In [55]:
filename = './students.tfrecords'
data = {
    'ID'     : 61553,
    'Name'   : ['Jones','Felicity'],
    'Scores' : [45.6 , 97.2]
}

Using this, we can construct a tf.train.Example class, again using the Feature( ) method.
Notw how we have to encode our string

In [56]:
ID = tf.train.Feature(int64_list = tf.train.Int64List(value = [data['ID']]))
Name = tf.train.Feature(bytes_list=tf.train.BytesList(value=[n.encode('utf-8') for n in data['Name']]))
Scores = tf.train.Feature(float_list = tf.train.FloatList(value = data['Scores']))

example = tf.train.Example(features = tf.train.Features(feature = {'ID' : ID, 'Name' : Name, 'Scores' : Scores}))

Serializing and writing this recored to disc is the same as TFRecord example 1:

In [57]:
writer = tf.io.TFRecordWriter('/Users/manideepbangaru/Documents/TensorFlow_DeepLearning-master/Students.tfrecords')
writer.write(example.SerializeToString())
writer.close()

To read this back, we just need to construct our parse_function to reflect the structure of the record

In [58]:
dataset = tf.data.TFRecordDataset("/Users/manideepbangaru/Documents/TensorFlow_DeepLearning-master/Students.tfrecords")

def parse_function(example_proto):
    keys_to_features = {'ID' : tf.io.FixedLenFeature([],dtype = tf.int64),
                       'Name' : tf.io.VarLenFeature(dtype = tf.string),
                       'Scores' : tf.io.VarLenFeature(dtype = tf.float32)
                       }
    parsed_features = tf.io.parse_single_example(serialized = example_proto,
                                                features = keys_to_features)
    return parsed_features['ID'], parsed_features['Name'], parsed_features['Scores']
    
    

The next step is the same as before

In [59]:
dataset = dataset.map(parse_function)

iterator = tf.compat.v1.data.make_one_shot_iterator(dataset)
item = iterator.get_next()

# record is retrieved as one item
print(item)

(<tf.Tensor: shape=(), dtype=int64, numpy=61553>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x151c4b6d0>, <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x151cbcb50>)


Now we can extract our data from item (note that the string must be decoded (from bytes) where the default for our Python 3 is utf8). Note also that the string and the array of floats are returned as sparse arrays, and to extract them from the record, we use the sparse array value method

In [60]:
print('ID : ',item[0].numpy())
name = item[1].values.numpy()
name1 = name[0].decode()
name2 = name[1].decode('utf8')
print('Name:',name1,",",name2)
print("Scores: ",item[2].values.numpy())

ID :  61553
Name: Jones , Felicity
Scores:  [45.6 97.2]


## One-hot encoding

One-hot encoding (OHE) is where a tensor is constructed from the data labels with a 1 in each of the elements corresponding to a label's value, and 0 everywhere else; that is, one of the bits in the tensor is hot (1)

### OHE example 1

In this example, we are converting a decimal value of 5 to a one-hot encoded value of 0000100000 using the tf.one_hot() method

In [62]:
y = 5
y_train_ohe = tf.one_hot(y,depth = 10).numpy()
print(y,'is',y_train_ohe,'when one-hot encoded with a depth of 10')

5 is [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] when one-hot encoded with a depth of 10


### OHE example 2

This is also nicely shown in the following example using the sample code that imports from the fashion MNIST dataset.

The original labels are integers from 0 to 9, so, for example, a label of 2 becomes 0010000000 when one-hot encoded, but note the difference between the index and the label stored at that index

In [71]:
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
tf.compat.v1.enable_eager_execution()

width,height, = 28,28
n_classes = 10

In [73]:
# load the dataset
(x_train,y_train),(x_test,y_test) = fashion_mnist.load_data()

split = 5000
# split feature training set into training and validation sets
(y_train,y_valid) = y_train[:split],y_train[split:]

In [74]:
# one hot encode the labels using TensorFlow
# then convert back to numpy for display
y_train_ohe = tf.one_hot(y_train,depth = n_classes).numpy()
y_valid_ohe = tf.one_hot(y_valid,depth = n_classes).numpy()
y_test_ohe = tf.one_hot(y_test,depth = n_classes).numpy()

show the difference between original label and one hot encoded label

In [77]:
i = 5
print(y_train[i]) # 'ordinary' number value of label at index i=5 is 2

2


In [79]:
# note the difference between the index of 5 and the label at that index which is 2
print(y_train_ohe[i])

[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
