The code is accompanied by my blogpost where I go into more detail about the decisions I've made, and whatever notes I may make

The link to the [blogpost](https://ianqs.github.io/blog/2019/01/05/TF-dataset-madness)

The dataset is from [UCI Covertype dataset](https://archive.ics.uci.edu/ml/datasets/covertype)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import tensorflow as tf

#tf.enable_eager_execution()
import numpy as np
import os
import datetime
import tqdm
import sys
import pprint

# 1) Training Pipeline

## 1.1) Producer:

- Ideally takes arbitrary datasets (np, csv, .data)

- $\lambda$: (x) -> `tf.TfRecords`

- loads from `unprocessed_data` folder in this tutorial and writes to `processed_data` folder 

## 1.2) Provider:

- loads from `processed_data` folder 

- processes the data (so that the processing is part of the computation graph)

- loads `tf.TfRecords` and sends it directly to tensorflow. Avoids `feed_dict` which is [slow](https://www.tensorflow.org/guide/performance/overview#input_pipeline)

# Producer:

## Method 1

1: Load the data

2: Conversion of a row of data into formats compatible with tf

3: Save it as a tfRecords file

## Method 2 (not shown here)

You can load it in as a tf.data.Dataset, then use the `experimental` library to construct a tfRecords file. A good resource for this is [official docs](https://www.tensorflow.org/tutorials/load_data/tf-records#tfexample)


```
serialized_features_dataset = features_dataset.map(tf_serialize_example)
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)
```


# Dataset Information

    Elevation                               quantitative    meters                       Elevation in meters
    
    Aspect                                  quantitative    azimuth                      Aspect in degrees azimuth
    
    Slope                                   quantitative    degrees                      Slope in degrees
    
    Horizontal_Distance_To_Hydrology        quantitative    meters                       Horz Dist to nearest surface water features
    
    Vertical_Distance_To_Hydrology          quantitative    meters                       Vert Dist to nearest surface water features
    
    Horizontal_Distance_To_Roadways         quantitative    meters                       Horz Dist to nearest roadway
    
    Hillshade_9am                           quantitative    0 to 255 index               Hillshade index at 9am, summer solstice
    
    Hillshade_Noon                          quantitative    0 to 255 index               Hillshade index at noon, summer soltice
    
    Hillshade_3pm                           quantitative    0 to 255 index               Hillshade index at 3pm, summer solstice
    
    Horizontal_Distance_To_Fire_Points      quantitative    meters                       Horz Dist to nearest wildfire ignition points
    
    Wilderness_Area (4 binary columns)      qualitative     0 (absence) or 1 (presence)  Wilderness area designation
    
    Soil_Type (40 binary columns)           qualitative     0 (absence) or 1 (presence)  Soil Type designation
    
    Cover_Type (7 types)                    integer         1 to 7                       Forest Cover Type designation
    

# Producer Pipeline: 

In [21]:
# Prototype class - enables easier feature definition and better for refactoring

class FeatureProto(object):
    from collections import namedtuple
    import numpy as np
    
    proto = namedtuple('prototype', ['name', 'dtype', 'shape'])
    
    # Reading the data
    features = [
        proto(name='Elevation', dtype=tf.float32, shape=1),
        proto(name='Aspect', dtype=tf.float32, shape=1),
        proto(name='Slope', dtype=tf.float32, shape=1),
        proto(name='Horizontal_Distance_To_Hydrology', dtype=tf.float32, shape=1),
        proto(name='Vertical_Distance_To_Hydrology', dtype=tf.float32, shape=1),
        proto(name='Horizontal_Distance_To_Roadways', dtype=tf.float32, shape=1),
        proto(name='Hillshade_9am', dtype=tf.float32, shape=1),
        proto(name='Hillshade_Noon', dtype=tf.float32, shape=1),
        proto(name='Hillshade_3pm', dtype=tf.float32, shape=1),
        proto(name='Horizontal_Distance_To_Fire_Points', dtype=tf.float32, shape=1),
        proto(name='Wilderness_Area', dtype=tf.float32, shape=4),
        proto(name='Soil_Type', dtype=tf.float32, shape=40),
        proto(name='Cover_Type', dtype=tf.float32, shape=1),
    ]
    
    @property
    def size(self):
        size = 0
        for prototype in self.features:
            size += prototype.shape
        return size
    
    def dataset_creation(self, data):
        idx = 0
        collection = {}
        for prototype in self.features:            
            encoded_feature = self._generate_feature(
                prototype.dtype, prototype.shape, data, idx
            )
            collection[prototype.name] = encoded_feature
            idx += prototype.shape
        return collection
    
    def _generate_feature(self, dtype, shape, data, idx):
        datum = [data[idx]] if shape == 1 else data[idx:idx + shape]
        
        if dtype == tf.float16 or dtype == tf.float32 or dtype == tf.float64:
            encoded_feature = _float_feature(datum, shape)
        elif dtype == tf.int16 or dtype == tf.int32 or dtype == tf.int64:
            encoded_feature = _int64_feature(datum, shape)
        elif dtype == tf.string:
            encoded_feature = _bytes_feature(datum, shape)
        else:
            raise NotImplementedError('Unmated type while generating feature in FeatureProto')
        return encoded_feature
    
    def unpack(self, example_proto):
        features = self._dataset_parsing()
        parsed_features = tf.parse_single_example(example_proto, features)
        labels = parsed_features['Cover_Type']
        parsed_features.pop('Cover_Type')
        # Then, convert the dataset into tensors which tensorflow expects?
        parsed_features['Soil_Type'] = tf.convert_to_tensor(parsed_features['Soil_Type'])
        parsed_features['Wilderness_Area'] = tf.cast(tf.argmax(parsed_features['Wilderness_Area'], axis=0), dtype=tf.float32)
        labels = tf.cast(labels, dtype=tf.int32)
        #labels = tf.one_hot(tf.cast(labels, dtype=tf.uint8), 8, on_value=1, off_value=0, axis=-1)
        return parsed_features, labels
            
        
    def _dataset_parsing(self):
        if hasattr(self, 'parser_proto'):
            return self.parser_proto
        else:
            parser_proto = {}
            for prototype in self.features:
                feat_name = prototype.name
                dtype = prototype.dtype
                shape = prototype.shape
                parser_proto[feat_name] = tf.FixedLenFeature(() if shape == 1 else (shape), dtype)
            self.parser_proto = parser_proto
            return self.parser_proto


feature_proto = FeatureProto()

In [22]:
def load_data():
    loaded = np.loadtxt('unprocessed_data/covtype.data', delimiter=',', dtype=np.int)  # Avoid tf.contrib since we want to get our hands dirty
    print(loaded.shape)
    all_ind = np.arange(0, len(loaded))
    train_ind = all_ind[: int(len(loaded) * 0.8)]
    test_ind = all_ind[int(len(loaded) * 0.8): ]
    
    return loaded, all_ind, train_ind, test_ind


load_data_run = False
if load_data_run:
    loaded, all_ind, train_ind, test_ind = load_data()
else:
    print('Flip load_data_run load in data from unprocessed_data folder')

Flip load_data_run load in data from unprocessed_data folder


In [23]:
def generate_samples(feature_proto):
    try:
        os.mkdir('processed_data')
    except:
        pass
    
    time = str(datetime.datetime.now().replace(microsecond=0,second=0,minute=0)).replace(' ', '_')
    for record_type in [('train', train_ind), ('test', test_ind)]:
        filename = 'processed_data/tf_record_covtype_{}_{}.tfrecord'.format(
            record_type[0],
            time
        )  # Round to the previous hour
        with tf.python_io.TFRecordWriter(filename) as writer:
            for i in tqdm.tqdm_notebook(record_type[1]):
                datum = loaded[i, :]
                feature = feature_proto.dataset_creation(datum)
                example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
                writer.write(example_proto.SerializeToString())

        # using your storage system -S3 or some other file hosting service, add the export here
        
generate_run = False

if generate_run:
    generate_samples(feature_proto)
else:
    print('Flip generate_run to generate_samples')

Flip generate_run to generate_samples


# Provider

Cell 1: Initialize the loader

- even though there is the train, and test data in the tfRecordDataset, we pretend that they're two different runs of our pre-processor

- The proto_wrap function is unnecessary here but for the sake of clarity I left it in. In the next tutorial, where I show you how to use the loaded data, we will remove it

Cell 2: Provide

- return an iterator that you can go through to iterate your dataset using either 1) stored tfrecords, or via tf.data.Datasets on numpy/ csvs

Cell 3: Iteration

In [25]:
configuration = 'csv'  # Options: tf, csv, np

# Option 1: reading tf.data.TFRecordDataset
#     - requires that you generate it first
if configuration == 'tf':
    filename_list = []
    for dirname, dirnames, filenames in os.walk('processed_data/'):
        for f in filenames:
            if "tfrecords" in f:
                filename_list.append('{}{}'.format(dirname, f))
    print(filename_list)
    dataset = tf.data.TFRecordDataset(filename_list)
    num_cpus = os.cpu_count()
    training_dataset_next = dataset_config(filename_list, mapper=feature_proto.unpack, num_cpus=num_cpus)

    
# Note, 
# Option 2: reading as a CSV
elif configuration == 'csv':
    filename_queue = tf.train.string_input_producer(['unprocessed_data/covtype.csv'])
    reader = tf.TextLineReader()
    k, v = reader.read(filename_queue)
    
    record_defaults = [[0] for _ in range(feature_proto.size)]
    
    columns = tf.decode_csv(v, record_defaults=record_defaults)
    """ FILL IN """
    

# Option 3: Reading Np. Note that there is a 2GB limit and you should avoid this
else:
    with np.load('unprocessed_data/covtype.npy') as data:
        features = data["features"]
        labels = data["labels"]
        
        
        features_placeholder = tf.placeholder(features.dtype, features.shape)
        labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

        
        training_dataset_next = dataset_config()

['processed_data/tf_record_covtype_train_2019-01-05_19:00:00', 'processed_data/tf_record_covtype_test_2019-01-05_19:00:00']


In [27]:
# Lazy execution

init = tf.global_variables_initializer()
with tf.Session() as sess:
    if configuration == 'np':

    sess.run(init)
    for i in range(2):
        features, label = sess.run(training_dataset_next)
        pprint.pprint(features.keys())
    
# Eager execution
#features, label = training_dataset_next

Instructions for updating:
To construct input pipelines, use the `tf.data` module.
dict_keys(['Aspect', 'Elevation', 'Hillshade_3pm', 'Hillshade_9am', 'Hillshade_Noon', 'Horizontal_Distance_To_Fire_Points', 'Horizontal_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Slope', 'Soil_Type', 'Vertical_Distance_To_Hydrology', 'Wilderness_Area'])
dict_keys(['Aspect', 'Elevation', 'Hillshade_3pm', 'Hillshade_9am', 'Hillshade_Noon', 'Horizontal_Distance_To_Fire_Points', 'Horizontal_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Slope', 'Soil_Type', 'Vertical_Distance_To_Hydrology', 'Wilderness_Area'])


In [10]:
pprint.pprint(features)

{'Aspect': array([ 51.,  56., 139., 155.,  45., 132.,  45.,  49.,  45.,  59., 201.,
       151., 134., 214., 157.,  51., 259.,  72.,   0.,  38.,  71., 209.,
       114.,  54.,  22., 135., 163., 148., 135., 117., 122., 105.],
      dtype=float32),
 'Elevation': array([2596., 2590., 2804., 2785., 2595., 2579., 2606., 2605., 2617.,
       2612., 2612., 2886., 2742., 2609., 2503., 2495., 2610., 2517.,
       2504., 2503., 2501., 2880., 2768., 2511., 2507., 2492., 2489.,
       2962., 2811., 2739., 2703., 2522.], dtype=float32),
 'Hillshade_3pm': array([148., 151., 135., 122., 150., 140., 138., 144., 133., 124., 161.,
       136.,  92., 170., 151., 137., 161., 133., 156., 144., 126., 179.,
        71., 130., 143., 142., 145., 120., 154.,  71.,  52., 130.],
      dtype=float32),
 'Hillshade_9am': array([221., 220., 234., 238., 220., 230., 222., 222., 223., 228., 218.,
       234., 248., 213., 224., 224., 216., 228., 214., 220., 230., 206.,
       252., 225., 215., 229., 230., 240., 220., 253

In [11]:
pprint.pprint(label)

array([[0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [0,

# Done! 

And with that, we're done! We've 

1) taken a non-trivial dataset, 
2) converted it into a `tfRecord`
3) shown how to unload it and read from it