# How to Load CSV  in TensorFlow 2.0

In [1]:
import functools
import numpy as np
import tensorflow as tf
import pandas as pd
import os

print(f"TensorFlow version: {tf.__version__}")
# Make numpy values easier to read.
np.set_printoptions(precision=3, suppress=True)

TensorFlow version: 2.2.0


## Get the Data
First we will get the data, for this we will use the `keras`API, `tf.keras.utils.get_file` method: This will download the dataset, only if the data set does not exist already in memory

In [3]:
### LINKS TO  DOWNLOAD DATA SETS ###
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

### CHECK IF Datasets DIR EXISTS, IF NOT IT CREATES IT ###
if os.path.isdir('./Datasets') == False:
    os.mkdir("Datasets")

### PATH TO STore DATA SETS ###
path =  os.getcwd() + "\\Datasets"

### GET DATA SETS, ONLY IF DATA SET IS NOT DOWNLOADED ALREADY  ####
train_file_path = tf.keras.utils.get_file(path + "\\train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file(path +"\\eval.csv", TEST_DATA_URL)

## Load the Data 
The only column you need to identify explicitly is the one with the value that the model is intended to predict.

In [37]:
LABEL_COLUMN = 'survived'
LABELS = [0, 1]

Now read the CSV data from the file and create a dataset.

(For the full documentation, see [tf.data.experimental.make_csv_dataset](https://www.tensorflow.org/api_docs/python/tf/data/experimental/make_csv_dataset))

In [67]:
def get_dataset(file_path, **kwargs):
    """ Transform a CSV file to a Tf.data object:
        INPUTS:
        file_path: String: path to .csv file 
        **kargs:   Enables user add more paramerters inside:
                   tf.data.experimental.make_csv_dataset()
    """
    
    dataset = tf.data.experimental.make_csv_dataset(
        file_path, 
        batch_size=5, 
        label_name=LABEL_COLUMN, 
        na_value="?",
        num_epochs=1, 
        ignore_errors=True,  
        **kwargs)
    
    return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

### View the object ###
print(raw_train_data,"\n")
print(f"type: {type(raw_train_data)}")

<PrefetchDataset shapes: (OrderedDict([(sex, (None,)), (age, (None,)), (n_siblings_spouses, (None,)), (parch, (None,)), (fare, (None,)), (class, (None,)), (deck, (None,)), (embark_town, (None,)), (alone, (None,))]), (None,)), types: (OrderedDict([(sex, tf.string), (age, tf.float32), (n_siblings_spouses, tf.int32), (parch, tf.int32), (fare, tf.float32), (class, tf.string), (deck, tf.string), (embark_town, tf.string), (alone, tf.string)]), tf.int32)> 

type: <class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>


Each item in the dataset is a batch, represented as a tuple of (many examples, many labels). The data from the examples is organized in column-based tensors (rather than row-based tensors), each with as many elements as the batch size (5 in this case).

It might help to see this yourself.

In [68]:
def show_batch(dataset):
    """ Show a batch of the Dataset
    INPUTS:
    dataset  a tf... object
    """
    for batch, label in dataset.take(1):
        for key, value in batch.items():
            print(f"{key:20s}: {value.numpy()}")
            
show_batch(raw_train_data)

sex                 : [b'male' b'male' b'male' b'female' b'male']
age                 : [49. 28. 22. 49. 25.]
n_siblings_spouses  : [1 0 0 0 1]
parch               : [0 0 0 0 0]
fare                : [89.104  7.775  7.796 25.929 26.   ]
class               : [b'First' b'Third' b'Third' b'First' b'Second']
deck                : [b'C' b'unknown' b'unknown' b'D' b'unknown']
embark_town         : [b'Cherbourg' b'Southampton' b'Southampton' b'Southampton' b'Southampton']
alone               : [b'n' b'y' b'y' b'y' b'n']


As you can see, the columns in the CSV are named. The dataset constructor will pick these names up automatically. If the file you are working with does not contain the column names in the first line, pass them in a list of strings to the column_names argument in the make_csv_dataset function.

In [69]:
### LIST OF NAMES FOR CSV COLUMNS ###
CSV_COLUMNS = ['survived', 'sex', 'age', 'n_siblings_spouses', 'parch', 'fare', 'class', 'deck', 'embark_town', 'alone']

### LOAD CSV TO TF ###
temp_dataset = get_dataset(train_file_path, column_names=CSV_COLUMNS)

### SHOW A BATCH OF 5 elements ###
show_batch(temp_dataset)

sex                 : [b'female' b'female' b'male' b'male' b'female']
age                 : [42. 30. 35. 28. 18.]
n_siblings_spouses  : [0 3 0 1 2]
parch               : [0 0 0 2 2]
fare                : [227.525  21.     10.5    23.45  262.375]
class               : [b'First' b'Second' b'Second' b'Third' b'First']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'B']
embark_town         : [b'Cherbourg' b'Southampton' b'Southampton' b'Southampton' b'Cherbourg']
alone               : [b'y' b'n' b'y' b'n' b'n']


This example is going to use all the available columns. If you need to omit some columns from the dataset, create a list of just the columns you plan to use, and pass it into the (optional) select_columns argument of the constructor

In [70]:
SELECT_COLUMNS = ['survived', 'n_siblings_spouses', 'class', 'deck', 'alone']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)

n_siblings_spouses  : [0 0 0 0 0]
class               : [b'Third' b'Third' b'Third' b'Second' b'Third']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
alone               : [b'y' b'y' b'y' b'y' b'y']


## Data preprocessing

A CSV file can contain a variety of data types. Typically you want to convert from those mixed types to a fixed length vector before feeding the data into your model.

TensorFlow has a built-in system for describing common input conversions: tf.feature_column, see this tutorial for details.

You can preprocess your data using any tool you like (like nltk or sklearn), and just pass the processed output to TensorFlow.

The primary advantage of doing the preprocessing inside your model is that when you export the model it includes the preprocessing. This way you can pass the raw data directly to your model.

#### a) Continuous data 
If your data is already in an appropriate numeric format, you can pack the data into a vector before passing it off to the model:

In [71]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'parch', 'fare']
DEFAULTS = [0, 0.0, 0.0, 0.0, 0.0]
temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS,column_defaults = DEFAULTS)

show_batch(temp_dataset)

age                 : [28. 25. 16. 17. 25.]
n_siblings_spouses  : [0. 1. 0. 4. 0.]
parch               : [0. 0. 0. 2. 0.]
fare                : [7.75  7.925 7.75  7.925 7.896]


In [72]:
example_batch, labels_batch = next(iter(temp_dataset))

In [49]:
def pack(features, label):
    return tf.stack(list(features.values()), axis=-1), label

packed_dataset = temp_dataset.map(pack)

for features, labels in packed_dataset.take(1):
  print(features.numpy())
  print()
  print(labels.numpy())


[[ 16.      0.      0.      7.75 ]
 [ 27.      0.      2.     11.133]
 [ 50.      2.      0.    133.65 ]
 [ 28.      1.      0.     19.967]
 [ 28.      0.      0.    110.883]]

[1 1 1 0 1]


If you have mixed datatypes you may want to separate out these simple-numeric fields. The tf.feature_column api can handle them, but this incurs some overhead and should be avoided unless really necessary. Switch back to the mixed dataset:

In [73]:
show_batch(raw_train_data)

sex                 : [b'female' b'female' b'male' b'male' b'male']
age                 : [30. 15. 25. 28. 16.]
n_siblings_spouses  : [1 0 0 0 0]
parch               : [1 1 0 0 0]
fare                : [ 24.15  211.337  13.      7.896   9.5  ]
class               : [b'Third' b'First' b'Second' b'Third' b'Third']
deck                : [b'unknown' b'B' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'n' b'n' b'y' b'y' b'y']


In [74]:
example_batch, labels_batch = next(iter(temp_dataset))

So define a more general preprocessor that selects a list of numeric features and packs them into a single column:

In [75]:
class PackNumericFeatures(object):
  def __init__(self, names):
    self.names = names

  def __call__(self, features, labels):
    numeric_features = [features.pop(name) for name in self.names]
    numeric_features = [tf.cast(feat, tf.float32) for feat in numeric_features]
    numeric_features = tf.stack(numeric_features, axis=-1)
    features['numeric'] = numeric_features

    return features, labels

In [76]:
NUMERIC_FEATURES = ['age','n_siblings_spouses','parch', 'fare']

packed_train_data = raw_train_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

packed_test_data = raw_test_data.map(
    PackNumericFeatures(NUMERIC_FEATURES))

In [77]:
show_batch(packed_train_data)

sex                 : [b'male' b'male' b'female' b'male' b'male']
class               : [b'Third' b'Third' b'Third' b'Third' b'Second']
deck                : [b'unknown' b'unknown' b'unknown' b'unknown' b'unknown']
embark_town         : [b'Southampton' b'Southampton' b'Southampton' b'Southampton'
 b'Southampton']
alone               : [b'y' b'y' b'n' b'y' b'y']
numeric             : [[28.     0.     0.     7.775]
 [28.     0.     0.    14.5  ]
 [18.     1.     0.    17.8  ]
 [28.     0.     0.     8.05 ]
 [24.     0.     0.    13.   ]]


In [78]:
example_batch, labels_batch = next(iter(packed_train_data))

## Data Normalization

Continuous data should always be normalized.

In [79]:
desc = pd.read_csv(train_file_path)[NUMERIC_FEATURES].describe()
desc

Unnamed: 0,age,n_siblings_spouses,parch,fare
count,627.0,627.0,627.0,627.0
mean,29.631308,0.545455,0.379585,34.385399
std,12.511818,1.15109,0.792999,54.59773
min,0.75,0.0,0.0,0.0
25%,23.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,15.0458
75%,35.0,1.0,0.0,31.3875
max,80.0,8.0,5.0,512.3292


In [81]:
MEAN = np.array(desc.T['mean'])
STD = np.array(desc.T['std'])
print(MEAN, STD)


[29.631  0.545  0.38  34.385] [12.512  1.151  0.793 54.598]


Now create a numeric column. The tf.feature_columns.numeric_column API accepts a normalizer_fn argument, which will be run on each batch.

Bind the MEAN and STD to the normalizer fn using functools.partial.

In [84]:
# See what you just created.
def normalize_numeric_data(data, mean, std):
    return (data-mean)/std # Center the data

normalizer = functools.partial(normalize_numeric_data, mean=MEAN, std=STD)


numeric_column = tf.feature_column.numeric_column('numeric', normalizer_fn=normalizer, shape=[len(NUMERIC_FEATURES)])
numeric_columns = [numeric_column]
numeric_column

NumericColumn(key='numeric', shape=(4,), default_value=None, dtype=tf.float32, normalizer_fn=functools.partial(<function normalize_numeric_data at 0x000001F7F7C5B8B8>, mean=array([29.631,  0.545,  0.38 , 34.385]), std=array([12.512,  1.151,  0.793, 54.598])))

In [87]:
example_batch['numeric']

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[ 27.   ,   0.   ,   0.   ,  76.729],
       [ 17.   ,   0.   ,   2.   , 110.883],
       [ 15.   ,   1.   ,   0.   ,  14.454],
       [ 28.   ,   2.   ,   0.   ,  21.679],
       [  0.75 ,   2.   ,   1.   ,  19.258]], dtype=float32)>

In [88]:
numeric_layer = tf.keras.layers.DenseFeatures(numeric_columns)
numeric_layer(example_batch).numpy()

array([[-0.21 , -0.474, -0.479,  0.776],
       [-1.01 , -0.474,  2.043,  1.401],
       [-1.169,  0.395, -0.479, -0.365],
       [-0.13 ,  1.264, -0.479, -0.233],
       [-2.308,  1.264,  0.782, -0.277]], dtype=float32)

#### b) Categorical data
ome of the columns in the CSV data are categorical columns. That is, the content should be one of a limited set of options.

Use the tf.feature_column API to create a collection with a tf.feature_column.indicator_column for each categorical column.

In [91]:

CATEGORIES = {
    'sex': ['male', 'female'],
    'class' : ['First', 'Second', 'Third'],
    'deck' : ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'],
    'embark_town' : ['Cherbourg', 'Southhampton', 'Queenstown'],
    'alone' : ['y', 'n']
}


categorical_columns = []
for feature, vocab in CATEGORIES.items():
  cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
  categorical_columns.append(tf.feature_column.indicator_column(cat_col))
    
    
# See what you just created.
categorical_columns

[IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='sex', vocabulary_list=('male', 'female'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='class', vocabulary_list=('First', 'Second', 'Third'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='deck', vocabulary_list=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='embark_town', vocabulary_list=('Cherbourg', 'Southhampton', 'Queenstown'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
 IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='alone', vocabulary_list=('y', 'n'), dtype=tf.string, default_value=-1, num_oov_buckets=0))]

A next step would be to build a build a tf.keras.Sequential, starting with the preprocessing_layer