# Introduction to TensorFlow Datasets and Estimators

Datasets and Estimators are two key TensorFlow features:

- Datasets: Best practice way of creating input pipelines. Reading data in to your graph. 
- Estimators: A high-level API to create TensorFlow models. Estimators include canned models (out of the box) and custom estimators.

Below you the TensorFlow architecture including the dataset API an Estimators. Combined, they offer an easy way to create TensorFlow models:

![title](https://3.bp.blogspot.com/-l2UT45WGdyw/Wbe7au1nfwI/AAAAAAAAD1I/GeQcQUUWezIiaFFRCiMILlX2EYdG49C0wCLcBGAs/s1600/image6.png)


# Our Data

The trained model categorizes Iris flowers based on four botanical features (sepal length, sepal width, petal length, and petal width). So, during inference, you can provide values for those four features and the model will predict that the flower is one of the following three beautiful variants:

![title](https://www.tensorflow.org/images/iris_three_species.jpg)

In [34]:
import os

import six.moves.urllib.request as request
import tensorflow as tf
import pandas as pd

# Check that we have correct TensorFlow version installed
tf_version = tf.__version__

In [35]:
tf.logging.set_verbosity(tf.logging.INFO)

# The Model

We're going to train a Deep Neural Network Classifier with the below structure. All input and output values will be float32, and the sum of the output values will be 1 (as we are predicting the probability for each individual Iris type):

![title](https://1.bp.blogspot.com/-EEdRK1mK1QQ/Wbe7qPWECZI/AAAAAAAAD1Q/fjnpGIiRIosTZ3YupkgiKJVaBtPg8KvGwCLcBGAs/s1600/image3.jpg)

# Let's get our Data

First we need to fetch our data. It can be downloaded as csv from our Tensorflow website. 

In [36]:
#Path to where data will be stored 
PATH = "/tmp/tf_dataset_and_estimator_apis"

In [37]:
# Fetch and store Training and Test dataset files
PATH_DATASET = PATH + os.sep + "dataset"
FILE_TRAIN = PATH_DATASET + os.sep + "iris_training.csv"
FILE_TEST = PATH_DATASET + os.sep + "iris_test.csv"
URL_TRAIN = "http://download.tensorflow.org/data/iris_training.csv"
URL_TEST = "http://download.tensorflow.org/data/iris_test.csv"

In [38]:
# This function will fetch (download) the data.

def downloadDataset(url, file):
    if not os.path.exists(PATH_DATASET):
        os.makedirs(PATH_DATASET)
    if not os.path.exists(file):
        data = request.urlopen(url).read()
        with open(file, "wb") as f:
            f.write(data)
            f.close()
downloadDataset(URL_TRAIN, FILE_TRAIN)
downloadDataset(URL_TEST, FILE_TEST)

# Specify Metadata

To describe our dataset, we first create a list of our features.

**Exercise 1:**

- *Create a list that stores all the input feature names. Tip: Have a look at the image of the model*

In [None]:
# List of the features. 

feature_names =  # Put your code here. 

# Create input function

When we train our model, we'll need a function that reads the input file and returns the feature and label data. Estimators requires that you create a function of the following format. The return value must be a two-element tuple organized as follows: :

- The first element must be a dict in which each input feature is a key, and then a list of values for the training batch.
- The second element is a list of labels for the training batch.

Since we are returning a batch of input features and training labels, it means that all lists in the return statement will have equal lengths. Technically speaking, whenever we referred to "list" here, we actually mean a 1-d TensorFlow tensor.

In [None]:
def input_fn(file_path, perform_shuffle=False, repeat_count=1):
    def decode_csv(line):
        # Convert CSV records to tensors. Each column maps to one tensor.
        parsed_line = tf.decode_csv(line, [[0.], [0.], [0.], [0.], [0]])
        label = parsed_line[-1]  # Last element is the label
        del parsed_line[-1]  # Delete last element (label)
        features = parsed_line  # Everything but last elements are the features
        d = dict(zip(feature_names, features)), label # This will return the tupel
        return d
    
    #A Dataset comprising lines from one or more text files
    dataset = (tf.data.TextLineDataset(file_path)  # Read text file
               .skip(1)  # Skip header row
               .map(decode_csv))  # Transform each elem by applying decode_csv fn
    
    if perform_shuffle:
        # Randomizes input using a window of 256 elements (read into memory)
        dataset = dataset.shuffle(buffer_size=256)
    dataset = dataset.repeat(repeat_count)  # Repeats dataset this # times
    dataset = dataset.batch(32)  # Batch size to use
    iterator = dataset.make_one_shot_iterator()
    batch_features, batch_labels = iterator.get_next()
    return batch_features, batch_labels

Note the following: :
- TextLineDataset: The Dataset API will do a lot of memory management for you when you're using its file-based datasets. You can, for example, read in dataset files much larger than memory or read in multiple files by specifying a list as argument.
- shuffle: Reads buffer_size records, then shuffles (randomizes) their order.
- map: Calls the decode_csv function with each element in the dataset as an argument (since we are using TextLineDataset, each element will be a line of CSV text). Then we apply decode_csv to each of the lines.
- decode_csv: Splits each line into fields, providing the default values if necessary. Then returns a dict with the field keys and field values. The map function updates each elem (line) in the dataset with the dict.

In [None]:
next_batch = input_fn(FILE_TRAIN, True) # Will return 32 random elements

# Now let's try it out, retrieving and printing one batch of data.
# Although this code looks strange, you don't need to understand
# the details.

with tf.Session() as sess:
    first_batch = sess.run(next_batch)
print(first_batch)

# Initialize and specify your Estimator

Estimators is a high-level API that reduces much of the boilerplate code you previously needed to write when training a TensorFlow model. Estimators are also very flexible, allowing you to override the default behavior if you have specific requirements for your model.

There are two possible ways you can build your model using Estimators:

Pre-made Estimator - These are predefined estimators, created to generate a specific type of model. In this blog post, we will use the DNNClassifier pre-made estimator.
Estimator (base class) - Gives you complete control of how your model should be created by using a model_fn function. We will cover how to do this in a separate blog post.
<p>
Here is the class diagram for Estimators:

![title](https://1.bp.blogspot.com/-njTtnjOq_cE/Wbe772URrgI/AAAAAAAAD1Y/h1mWj6MGSzYg_KDuVXWBYeNqA4z5WRSpACLcBGAs/s1600/image2.jpg)

<p>
As you can see, all estimators make use of input_fn that provides the estimator with input data. In our case, we will reuse my_input_fn, which we defined for this purpose.

In [None]:
# Create the feature_columns, which specifies the input to our model
# In this case our input features are numeric, so use numeric_column for each one

feature_columns = [tf.feature_column.numeric_column(k) for k in feature_names]

In [None]:
# Here we specify some of the hyperparameters and metadata,

num_hidden_units =[512, 256, 128] # Number of neurons and hidden layers
directory = './Checkpoints/checkpoints_tutorial17-1/"' 

In [None]:
# Here we instantiate the estimator. In this case we are using a DNN Classifier

classifier = tf.estimator.DNNClassifier(
    feature_columns=feature_columns, 
    hidden_units=num_hidden_units,
    n_classes = 3, 
    model_dir=directory)

In [None]:
# Now we have everything in place and we can start the training. 

classifier.train(
    input_fn=lambda: input_fn(FILE_TRAIN, True, 100))

But wait a minute... what is this "lambda: my_input_fn(FILE_TRAIN, True, 8)" stuff? That is where we hook up Datasets with the Estimators! Estimators needs data to perform training, evaluation, and prediction, and it uses the input_fn to fetch the data. Estimators require an input_fn with no arguments, so we create a function with no arguments using lambda, which calls our input_fn with the desired arguments: the file_path, shuffle setting, and repeat_count. In our case, we use our my_input_fn, passing it:

- FILE_TRAIN, which is the training data file.
- True, which tells the Estimator to shuffle the data.
- 8, which tells the Estimator to and repeat the dataset 8 times.

In [None]:
# Evaluate our model using the examples contained in FILE_TEST
# Return value will contain evaluation_metrics such as: loss & average_loss
evaluate_result = classifier.evaluate(
    input_fn=lambda: input_fn(FILE_TEST, False, 4))
print("Evaluation results")
for key in evaluate_result:
    print("   {}, was: {}".format(key, evaluate_result[key]))

**Exercise 2:**

- *Change some of the hyperparamaters of the model to see if you can* improve the performance. 
- *What is the activation function of the DNN Classifier? How can you change it?*
- *Tweak and change more of the code and see how things work*

**Exercise 3:**

- *Test a different model: Maybe Linear Classifier?*

In [40]:
# Write your new model here

# Let's do a prediction

And that's it! We now have a trained model, and if we are happy with the evaluation results, we can use it to predict an Iris flower based on some input. As with training, and evaluation, we make predictions using a single function call:

In [None]:
# Let create a dataset for prediction
# We've taken the first 3 examples in FILE_TEST
prediction_input = [[5.9, 3.0, 4.2, 1.5],  # -> 1, Iris Versicolor
                    [6.9, 3.1, 5.4, 2.1],  # -> 2, Iris Virginica
                    [5.1, 3.3, 1.7, 0.5]]  # -> 0, Iris Sentosa

In [None]:
def new_input_fn():
    def decode(x):
        x = tf.split(x, 4)  # Need to split into our 4 features
        return dict(zip(feature_names, x))  # To build a dict of them

    dataset = tf.data.Dataset.from_tensor_slices(prediction_input)
    dataset = dataset.map(decode)
    iterator = dataset.make_one_shot_iterator()
    next_feature_batch = iterator.get_next()
    return next_feature_batch, None  # In prediction, we have no labels

In [None]:
# Predict all our prediction_input
predict_results = classifier.predict(input_fn=new_input_fn)

In [None]:
# Print results
print("Predictions:")
for idx, prediction in enumerate(predict_results):
    type = prediction["class_ids"][0]  # Get the predicted class (index)
    if type == 0:
        print("  I think: {}, is Iris Sentosa".format(prediction_input[idx]))
    elif type == 1:
        print("  I think: {}, is Iris Versicolor".format(prediction_input[idx]))
    else:
        print("  I think: {}, is Iris Virginica".format(prediction_input[idx]))

# Freebies

**Exercise 4:**

- *Checkout Tensorboard and see if you can get it working*
- *Checkout your performance metrics and the graph*

In [None]:
# Write your code here

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License