# Input pipeline into Keras

In this notebook, we will look at how to read large datasets, datasets that may not fit into memory, using TensorFlow. We can use the tf.data pipeline to feed data to Keras models that use a TensorFlow backend.

Let's start off with the Python imports that we need.

In [2]:
import os, json, math
import numpy as np
import tensorflow as tf
print(tf.version.VERSION)

2.0.0-dev20190717


## Locating the CSV files

We will start with the CSV files that we wrote out in the [first notebook](../01_explore/taxifare.iypnb) of this sequence. Just so you don't have to run the notebook, we saved a copy in ../data

In [4]:
!ls -l ../data/*.csv

-rw-r--r-- 1 jupyter jupyter 123590 Jul 17 21:33 ../data/taxi-test.csv
-rw-r--r-- 1 jupyter jupyter 579055 Jul 17 21:33 ../data/taxi-train.csv
-rw-r--r-- 1 jupyter jupyter 123114 Jul 17 21:33 ../data/taxi-valid.csv


## Use tf.data to read the CSV files

See the documentation for [make_csv_dataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset).
If you have TFRecords (which is recommended), use [make_batched_features_dataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_batched_features_dataset) instead.

In [5]:
CSV_COLUMNS  = ['fare_amount',  'pickup_datetime',
                'pickup_longitude', 'pickup_latitude', 
                'dropoff_longitude', 'dropoff_latitude', 
                'passenger_count', 'key']
LABEL_COLUMN = 'fare_amount'
DEFAULTS     = [[0.0],['na'],[0.0],[0.0],[0.0],[0.0],[0.0],['na']]

In [6]:
# load the training data
def load_dataset(pattern):
  return tf.data.experimental.make_csv_dataset(pattern, 1, CSV_COLUMNS, DEFAULTS)

tempds = load_dataset('../data/taxi-train*')
print(tempds)

W0717 21:33:38.545008 140098805683968 deprecation.py:323] From /home/jupyter/.local/lib/python3.5/site-packages/tensorflow_core/python/data/experimental/ops/readers.py:499: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
W0717 21:33:38.586160 140098805683968 deprecation.py:323] From /home/jupyter/.local/lib/python3.5/site-packages/tensorflow_core/python/data/experimental/ops/readers.py:212: shuffle_and_repeat (from tensorflow.python.data.experimental.ops.shuffle_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.shuffle(buffer_size, seed)` followed by `tf.data.Dataset.repeat(count)`. Static tf.data optimizati

<PrefetchDataset shapes: OrderedDict([(fare_amount, (1,)), (pickup_datetime, (1,)), (pickup_longitude, (1,)), (pickup_latitude, (1,)), (dropoff_longitude, (1,)), (dropoff_latitude, (1,)), (passenger_count, (1,)), (key, (1,))]), types: OrderedDict([(fare_amount, tf.float32), (pickup_datetime, tf.string), (pickup_longitude, tf.float32), (pickup_latitude, tf.float32), (dropoff_longitude, tf.float32), (dropoff_latitude, tf.float32), (passenger_count, tf.float32), (key, tf.string)])>


Note that this is a prefetched dataset. If you loop over the dataset, you'll get the rows one-by-one. Let's convert each row into a Python dictionary:

In [7]:
# print a few of the rows
for n, data in enumerate(tempds):
    row_data = {k: v.numpy() for k,v in data.items()}
    print(n, row_data)
    if n > 2:
        break

0 {'pickup_longitude': array([-73.99723], dtype=float32), 'dropoff_longitude': array([-74.00726], dtype=float32), 'dropoff_latitude': array([40.708004], dtype=float32), 'pickup_datetime': array([b'2012-03-04 00:57:00 UTC'], dtype=object), 'pickup_latitude': array([40.721912], dtype=float32), 'fare_amount': array([6.1], dtype=float32), 'key': array([b'2272'], dtype=object), 'passenger_count': array([2.], dtype=float32)}
1 {'pickup_longitude': array([-73.97512], dtype=float32), 'dropoff_longitude': array([-73.97276], dtype=float32), 'dropoff_latitude': array([40.761013], dtype=float32), 'pickup_datetime': array([b'2009-05-27 20:37:00 UTC'], dtype=object), 'pickup_latitude': array([40.752235], dtype=float32), 'fare_amount': array([5.3], dtype=float32), 'key': array([b'4498'], dtype=object), 'passenger_count': array([2.], dtype=float32)}
2 {'pickup_longitude': array([-73.97003], dtype=float32), 'dropoff_longitude': array([-73.99409], dtype=float32), 'dropoff_latitude': array([40.752304], d

What we really need is a dictionary of features + a label. So, we have to do two things to the above dictionary. (1) remove the unwanted column "key" and (2) keep the label separate from the features.

In [8]:
# get features, label
def features_and_labels(row_data):
    for unwanted_col in ['pickup_datetime', 'key']:
        row_data.pop(unwanted_col)
    label = row_data.pop(LABEL_COLUMN)
    return row_data, label  # features, label

# print a few rows to make it sure works
for n, data in enumerate(tempds):
    row_data = {k: v.numpy() for k,v in data.items()}
    features, label = features_and_labels(row_data)
    print(n, label, features)
    if n > 2:
        break

0 [6.1] {'pickup_longitude': array([-73.99723], dtype=float32), 'dropoff_longitude': array([-74.00726], dtype=float32), 'dropoff_latitude': array([40.708004], dtype=float32), 'pickup_latitude': array([40.721912], dtype=float32), 'passenger_count': array([2.], dtype=float32)}
1 [5.3] {'pickup_longitude': array([-73.97512], dtype=float32), 'dropoff_longitude': array([-73.97276], dtype=float32), 'dropoff_latitude': array([40.761013], dtype=float32), 'pickup_latitude': array([40.752235], dtype=float32), 'passenger_count': array([2.], dtype=float32)}
2 [15.7] {'pickup_longitude': array([-73.97003], dtype=float32), 'dropoff_longitude': array([-73.99409], dtype=float32), 'dropoff_latitude': array([40.752304], dtype=float32), 'pickup_latitude': array([40.799618], dtype=float32), 'passenger_count': array([5.], dtype=float32)}
3 [12.] {'pickup_longitude': array([-73.97913], dtype=float32), 'dropoff_longitude': array([-73.9691], dtype=float32), 'dropoff_latitude': array([40.79593], dtype=float32)

## Batching

Let's do both (loading, features_label)
in our load_dataset function, and also add batching.

In [9]:
def load_dataset(pattern, batch_size):
  return (
      tf.data.experimental.make_csv_dataset(pattern, batch_size, CSV_COLUMNS, DEFAULTS)
             .map(features_and_labels) # features, label
  )

# try changing the batch size and watch what happens.
tempds = load_dataset('../data/taxi-train*', batch_size=2)
print(list(tempds.take(3))) # truncate and print as a list 

[(OrderedDict([('pickup_longitude', <tf.Tensor: id=227, shape=(2,), dtype=float32, numpy=array([-73.98064 , -73.974396], dtype=float32)>), ('pickup_latitude', <tf.Tensor: id=226, shape=(2,), dtype=float32, numpy=array([40.730053, 40.752274], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: id=224, shape=(2,), dtype=float32, numpy=array([-73.98362, -73.99073], dtype=float32)>), ('dropoff_latitude', <tf.Tensor: id=223, shape=(2,), dtype=float32, numpy=array([40.721794, 40.750866], dtype=float32)>), ('passenger_count', <tf.Tensor: id=225, shape=(2,), dtype=float32, numpy=array([1., 1.], dtype=float32)>)]), <tf.Tensor: id=228, shape=(2,), dtype=float32, numpy=array([5. , 8.5], dtype=float32)>), (OrderedDict([('pickup_longitude', <tf.Tensor: id=233, shape=(2,), dtype=float32, numpy=array([-73.99829, -73.9817 ], dtype=float32)>), ('pickup_latitude', <tf.Tensor: id=232, shape=(2,), dtype=float32, numpy=array([40.71353 , 40.778484], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: id=230

## Shuffling

When training a deep learning model in batches over multiple workers, it is helpful if we shuffle the data. That way, different workers will be working on different parts of the input file at the same time, and so averaging gradients across workers will help. Also, during training, we will need to read the data indefinitely.

In [10]:
def load_dataset(pattern, batch_size=1, mode=tf.estimator.ModeKeys.EVAL):
  dataset = (tf.data.experimental.make_csv_dataset(pattern, batch_size, CSV_COLUMNS, DEFAULTS)
             .map(features_and_labels) # features, label
             .cache())
  if mode == tf.estimator.ModeKeys.TRAIN:
        dataset = dataset.shuffle(1000).repeat()
  dataset = dataset.prefetch(1) # take advantage of multi-threading; 1=AUTOTUNE
  return dataset

tempds = load_dataset('../data/taxi-train*', 2, tf.estimator.ModeKeys.TRAIN)
print(list(tempds.take(1)))
tempds = load_dataset('../data/taxi-valid*', 2, tf.estimator.ModeKeys.EVAL)
print(list(tempds.take(1)))

[(OrderedDict([('pickup_longitude', <tf.Tensor: id=341, shape=(2,), dtype=float32, numpy=array([-73.97186, -74.00559], dtype=float32)>), ('pickup_latitude', <tf.Tensor: id=340, shape=(2,), dtype=float32, numpy=array([40.750294, 40.74017 ], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: id=338, shape=(2,), dtype=float32, numpy=array([-73.97463, -73.9615 ], dtype=float32)>), ('dropoff_latitude', <tf.Tensor: id=337, shape=(2,), dtype=float32, numpy=array([40.736443, 40.768517], dtype=float32)>), ('passenger_count', <tf.Tensor: id=339, shape=(2,), dtype=float32, numpy=array([2., 1.], dtype=float32)>)]), <tf.Tensor: id=342, shape=(2,), dtype=float32, numpy=array([ 8.5, 16.1], dtype=float32)>)]
[(OrderedDict([('pickup_longitude', <tf.Tensor: id=437, shape=(2,), dtype=float32, numpy=array([-74.01015 , -73.967316], dtype=float32)>), ('pickup_latitude', <tf.Tensor: id=436, shape=(2,), dtype=float32, numpy=array([40.711975, 40.76644 ], dtype=float32)>), ('dropoff_longitude', <tf.Tensor: id=

In the next notebook, we will build the model using this input pipeline.

Copyright 2019 Google Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.