In [1]:
import tensorflow as tf

When data used to train a model sits in memory, we can create an input pipeline by constructing a data set using **tf.data.Dataset.from_tensors** or **tf.data.Dataset.from_tensor_slices**

In [2]:
def create_dataset(X, Y, epoch, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices((X, Y))
    dataset = dataset.repeat(epochs).batch(batch_size, drop_remainder = True)
    
    return dataset

What is the difference between both functions?  
**from_tensors**: Combines the input and returns a dataset with a single element  
**from_tensor_slices**: Creates a dataset with a separate element for each row of the input tensor  
Let's take a look.

In [7]:
# Tensor that is going to be used to create the datasets
t = tf.constant([[4, 2], [5, 3]])

ds_tensors = tf.data.Dataset.from_tensors(t)
print('Dataset created with from_tensors: \n', str(ds_tensors), '\n')

ds_tensor_slices = tf.data.Dataset.from_tensor_slices(t)
print('Dataset created with from_tensors: \n', str(ds_tensor_slices))

Dataset created with from_tensors: 
 <TensorDataset shapes: (2, 2), types: tf.int32> 

Dataset created with from_tensors: 
 <TensorSliceDataset shapes: (2,), types: tf.int32>


What if data is not sitting in memory? Imagine we are trying to read data from a csv file, then we would use **TextLineDataset**.

In [8]:
def parse_row(records):
    cols = tf.decode_csv(records, record_defaults = [[0], ['house'], [0]])
    features = {'sq_footage': cols[0], 'type': cols[1]}
    label = cols[2]
    
    return features, label

In [9]:
def create_dataset(csv_file_path):
    dataset = tf.data.TextLineDataset(csv_file_path)
    dataset = dataset.map(parse_row)
    dataset = dataset.shuffle(1000).repeat(15).batch(128)
    
    return dataset

Imagine we are using the past functions with the following data from a csv file.  

|sq_footage | property_type | price |
| --- | --- | --- |
|1001 | house | 501 |
|2001 | house | 1001 |
|3001 | house | 1501 |
|1001 | apt | 701 |
|2001 | apt | 1301 |

Similarly, we can read a set of shared csv files using **TextLineDataset**.

In [15]:
def create_dataset(path):
    dataset = tf.data.Dataset.list_files(path)                   \
                             .flat_map(tf.data.TextLineDataset)  \
                             .map(parse_row)
    
    # List files:
    # Getting all file names from the path
    
    # Flat map:
    # Join all csv in a single dataset (one to many transformation)
    
    # Map:
    # Mapping each row (one to one transformation)
    
    dataset = dataset.shuffle(1000) \
                     .repeat(15)    \
                     .batch(128)
    
    return dataset

Feature columns bridge the gap between colyumns in a csv file to the features used to train a model. MODELS WANT TO TRAIN WITH NUMBERS.

In [17]:
featcols = [
    tf.feature_column.numeric_column('sq_footage'),
    tf.feature_column.categorical_column_with_vocabulary_list('type',
                                                              ['house', 'apt']) # Possible values in []
]

We use the feature column API to determine the features.  
**numeric_column** for numerical columns  
**categorical_column_with_vocabulary_list** for the property type. Use this when your inputs are in a string or integer format and you have an in memory vocabulary mapping to each value to an integer ID. By default, out of vocabulary values are ignored.  
  
Under the hood: feature columns take care of packing the inputs intto the input vector of the model.