# Introduction to TensorFlow Datasets and Estimators

Datasets and Estimators are two key TensorFlow features:

- Datasets: Best practice way of creating input pipelines. Reading data in to your graph. 
- Estimators: A high-level API to create TensorFlow models. Estimators include canned models (out of the box) and custom estimators.

Below you the TensorFlow architecture including the dataset API an Estimators. Combined, they offer an easy way to create TensorFlow models:

![title](https://3.bp.blogspot.com/-l2UT45WGdyw/Wbe7au1nfwI/AAAAAAAAD1I/GeQcQUUWezIiaFFRCiMILlX2EYdG49C0wCLcBGAs/s1600/image6.png)


# Our Data

The trained model categorizes Iris flowers based on four botanical features (sepal length, sepal width, petal length, and petal width). So, during inference, you can provide values for those four features and the model will predict that the flower is one of the following three beautiful variants:

![title](https://www.tensorflow.org/images/iris_three_species.jpg)

In [1]:
import os

import six.moves.urllib.request as request
import tensorflow as tf

# Check that we have correct TensorFlow version installed
tf_version = tf.__version__

  from ._conv import register_converters as _register_converters


In [2]:
tf.logging.set_verbosity(tf.logging.INFO)

# Let's get our Data

In [3]:
PATH = "/tmp/tf_dataset_and_estimator_apis"

In [4]:
# Fetch and store Training and Test dataset files
PATH_DATASET = PATH + os.sep + "dataset"
FILE_TRAIN = PATH_DATASET + os.sep + "iris_training.csv"
FILE_TEST = PATH_DATASET + os.sep + "iris_test.csv"
URL_TRAIN = "http://download.tensorflow.org/data/iris_training.csv"
URL_TEST = "http://download.tensorflow.org/data/iris_test.csv"

In [5]:
def downloadDataset(url, file):
    if not os.path.exists(PATH_DATASET):
        os.makedirs(PATH_DATASET)
    if not os.path.exists(file):
        data = request.urlopen(url).read()
        with open(file, "wb") as f:
            f.write(data)
            f.close()
downloadDataset(URL_TRAIN, FILE_TRAIN)
downloadDataset(URL_TEST, FILE_TEST)

# Specify Metadata

In [6]:
# List of the features
feature_names = [
    'SepalLength', 
    'Sepal_Width', 
    'PetalLength', 
    'PetalWidth']

# Create input function

When we train our model, we'll need a function that reads the input file and returns the feature and label data. Estimators requires that you create a function of the following format. The return value must be a two-element tuple organized as follows: :

The first element must be a dict in which each input feature is a key, and then a list of values for the training batch.
The second element is a list of labels for the training batch.

In [7]:
def input_fn(file_path, perform_shuffle=False, repeat_count=1):
    def decode_csv(line):
        # Convert CSV records to tensors. Each column maps to one tensor.
        parsed_line = tf.decode_csv(line, [[0.], [0.], [0.], [0.], [0]])
        label = parsed_line[-1]  # Last element is the label
        del parsed_line[-1]  # Delete last element (label)
        features = parsed_line  # Everything but last elements are the features
        d = dict(zip(feature_names, features)), label
        return d
    
    #A Dataset comprising lines from one or more text files
    dataset = (tf.data.TextLineDataset(file_path)  # Read text file
               .skip(1)  # Skip header row
               .map(decode_csv))  # Transform each elem by applying decode_csv fn
    
    if perform_shuffle:
        # Randomizes input using a window of 256 elements (read into memory)
        dataset = dataset.shuffle(buffer_size=256)
    dataset = dataset.repeat(repeat_count)  # Repeats dataset this # times
    dataset = dataset.batch(32)  # Batch size to use
    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

Note the following: :
- TextLineDataset: The Dataset API will do a lot of memory management for you when you're using its file-based datasets. You can, for example, read in dataset files much larger than memory or read in multiple files by specifying a list as argument.
- shuffle: Reads buffer_size records, then shuffles (randomizes) their order.
- map: Calls the decode_csv function with each element in the dataset as an argument (since we are using TextLineDataset, each element will be a line of CSV text). Then we apply decode_csv to each of the lines.
- decode_csv: Splits each line into fields, providing the default values if necessary. Then returns a dict with the field keys and field values. The map function updates each elem (line) in the dataset with the dict.

In [8]:
next_batch = input_fn(FILE_TRAIN, True) # Will return 32 random elements

# Now let's try it out, retrieving and printing one batch of data.
# Although this code looks strange, you don't need to understand
# the details.
with tf.Session() as sess:
    first_batch = sess.run(next_batch)
print(first_batch)

({'SepalLength': array([5. , 5.1, 5.1, 5. , 5.2, 7. , 5.6, 6.7, 5.7, 5.8, 4.9, 5.1, 4.8,
       6.1, 5.6, 7.2, 6.6, 7.6, 6. , 5.2, 5.9, 5.2, 7.9, 7.4, 6.4, 5. ,
       5.1, 5. , 4.9, 5.7, 6.5, 5. ], dtype=float32), 'Sepal_Width': array([3.5, 3.8, 2.5, 3.5, 3.4, 3.2, 2.9, 3. , 2.8, 4. , 2.5, 3.8, 3.1,
       2.6, 2.7, 3.6, 3. , 3. , 2.9, 3.5, 3.2, 2.7, 3.8, 2.8, 3.2, 3.3,
       3.8, 2. , 2.4, 3. , 3. , 3.4], dtype=float32), 'PetalWidth': array([0.3, 0.2, 1.1, 0.6, 0.2, 1.4, 1.3, 2.3, 1.3, 0.2, 1.7, 0.3, 0.2,
       1.4, 1.3, 2.5, 1.4, 2.1, 1.5, 0.2, 1.8, 1.4, 2. , 1.9, 2.3, 0.2,
       0.4, 1. , 1. , 1.2, 1.8, 0.4], dtype=float32), 'PetalLength': array([1.3, 1.6, 3. , 1.6, 1.4, 4.7, 3.6, 5.2, 4.5, 1.2, 4.5, 1.5, 1.6,
       5.6, 4.2, 6.1, 4.4, 6.6, 4.5, 1.5, 4.8, 3.9, 6.4, 6.1, 5.3, 1.4,
       1.9, 3.5, 3.3, 4.2, 5.5, 1.6], dtype=float32)}, array([0, 0, 1, 0, 0, 1, 1, 2, 1, 0, 2, 0, 0, 2, 1, 2, 1, 2, 1, 0, 1, 1,
       2, 2, 2, 0, 0, 1, 1, 1, 2, 0], dtype=int32))


# Initialize and specify your model

In [9]:
# Create the feature_columns, which specifies the input to our model
# All our input features are numeric, so use numeric_column for each one
feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]

In [10]:
# In this case we are building a three layer network. Here we specify the amount of hidden units
num_hidden_units =[512, 256, 128] 
number_classes = 3
directory = './Checkpoints/checkpoints_tutorial17-1/"'

As you can see, all estimators make use of input_fn that provides the estimator with input data. In our case, we will reuse input_fn, which we defined for this purpose.

In [11]:
classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns,
    hidden_units=num_hidden_units,
    n_classes = 3, 
    model_dir=directory)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x1037b6b50>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': './Checkpoints/checkpoints_tutorial17-1/"', '_global_id_in_cluster': 0, '_save_summary_steps': 100}


In [12]:
classifier.train(
    input_fn=lambda: input_fn(FILE_TRAIN, True, 100))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into ./Checkpoints/checkpoints_tutorial17-1/"/model.ckpt.
INFO:tensorflow:loss = 34.649323, step = 1
INFO:tensorflow:global_step/sec: 283.743
INFO:tensorflow:loss = 7.8025436, step = 101 (0.355 sec)
INFO:tensorflow:global_step/sec: 315.448
INFO:tensorflow:loss = 0.5647123, step = 201 (0.316 sec)
INFO:tensorflow:global_step/sec: 356.741
INFO:tensorflow:loss = 0.28406665, step = 301 (0.280 sec)
INFO:tensorflow:Saving checkpoints for 375 into ./Checkpoints/checkpoints_tutorial17-1/"/model.ckpt.
INFO:tensorflow:Loss for final step: 0.24058434.


<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x11a461c10>

In [13]:
# Evaluate our model using the examples contained in FILE_TEST
# Return value will contain evaluation_metrics such as: loss & average_loss
evaluate_result = classifier.evaluate(
    input_fn=lambda: input_fn(FILE_TEST, False, 4))
print("Evaluation results")
for key in evaluate_result:
    print("   {}, was: {}".format(key, evaluate_result[key]))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-04-30-18:13:53
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./Checkpoints/checkpoints_tutorial17-1/"/model.ckpt-375
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-04-30-18:13:53
INFO:tensorflow:Saving dict for global step 375: accuracy = 0.96666664, average_loss = 0.053104945, global_step = 375, loss = 1.5931484
Evaluation results
   average_loss, was: 0.0531049445271
   accuracy, was: 0.966666638851
   global_step, was: 375
   loss, was: 1.59314835072


# Let's do a prediction

In [14]:
# Let create a dataset for prediction
# We've taken the first 3 examples in FILE_TEST
prediction_input = [[5.9, 3.0, 4.2, 1.5],  # -> 1, Iris Versicolor
                    [6.9, 3.1, 5.4, 2.1],  # -> 2, Iris Virginica
                    [5.1, 3.3, 1.7, 0.5]]  # -> 0, Iris Sentosa

In [15]:
def new_input_fn():
    def decode(x):
        x = tf.split(x, 4)  # Need to split into our 4 features
        return dict(zip(feature_names, x))  # To build a dict of them

    dataset = tf.data.Dataset.from_tensor_slices(prediction_input)
    dataset = dataset.map(decode)
    iterator = dataset.make_one_shot_iterator()
    next_feature_batch = iterator.get_next()
    return next_feature_batch, None  # In prediction, we have no labels

In [16]:
# Predict all our prediction_input
predict_results = classifier.predict(input_fn=new_input_fn)

In [17]:
# Print results
print("Predictions:")
for idx, prediction in enumerate(predict_results):
    type = prediction["class_ids"][0]  # Get the predicted class (index)
    if type == 0:
        print("  I think: {}, is Iris Sentosa".format(prediction_input[idx]))
    elif type == 1:
        print("  I think: {}, is Iris Versicolor".format(prediction_input[idx]))
    else:
        print("  I think: {}, is Iris Virginica".format(prediction_input[idx]))

Predictions:
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from ./Checkpoints/checkpoints_tutorial17-1/"/model.ckpt-375
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
  I think: [5.9, 3.0, 4.2, 1.5], is Iris Versicolor
  I think: [6.9, 3.1, 5.4, 2.1], is Iris Virginica
  I think: [5.1, 3.3, 1.7, 0.5], is Iris Sentosa
