<a href="https://colab.research.google.com/github/pA1nD/course-deep-learning/blob/master/L3_Data_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data input pipelines

Loading and working with data. Read data from various formats and store it as tf.data Dataset

And make sure to check how to [Build TensorFlow input pipelines](https://www.tensorflow.org/guide/data)

In [0]:
%tensorflow_version 2.x
import tensorflow as tf

import pandas as pd
import numpy as np

# Loading Data

There are more examples available at [https://www.tensorflow.org/tutorials/load_data/csv](https://www.tensorflow.org/tutorials/load_data/csv)

## From CSV

In [0]:
TRAIN_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/train.csv"
TEST_DATA_URL = "https://storage.googleapis.com/tf-datasets/titanic/eval.csv"

train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
test_file_path = tf.keras.utils.get_file("eval.csv", TEST_DATA_URL)

# you could also load this via pandas. As you see further down.

Print the first few lines from the csv file.

In [0]:
!head {train_file_path}

Load the data via `tf.data.experimental.make_csv_dataset()`

In [0]:
def get_dataset(file_path, **kwargs):
  dataset = tf.data.experimental.make_csv_dataset(
      file_path,
      batch_size=5, # Artificially small to make examples easier to show.
      label_name='survived', # The only column you need to identify explicitly is the one with the value that the model is intended to predict.
      na_value="?",
      num_epochs=1, # will cycle through dataset infinitely if undefined
      ignore_errors=True, 
      **kwargs)
  return dataset

raw_train_data = get_dataset(train_file_path)
raw_test_data = get_dataset(test_file_path)

In [0]:
def show_batch(dataset):
  for batch, label in dataset.take(1):
    for key, value in batch.items():
      print("{:20s}: {}".format(key,value.numpy()))
show_batch(raw_train_data)

To scope training only to a few available colums, pass them as `select_colums`

In [0]:
SELECT_COLUMNS = ['survived', 'age', 'n_siblings_spouses', 'class', 'deck', 'alone']

temp_dataset = get_dataset(train_file_path, select_columns=SELECT_COLUMNS)

show_batch(temp_dataset)

## From Numpy

In [0]:
train, test = tf.keras.datasets.fashion_mnist.load_data()
images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels))

## From Pandas DataFrame

In [0]:
csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/applied-dl/heart.csv')
df = pd.read_csv(csv_file)

df.head()

# we could see using df.dtypes that thal is an object.
# Convert thal column which is an object in the dataframe to a discrete numerical value.

df['thal'] = pd.Categorical(df['thal'])
df['thal'] = df.thal.cat.codes

target = df.pop('target')
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))

# shuffle and batch
train_dataset = dataset.shuffle(len(df)).batch(1)

# Preprocessing Data

The following section shows:
- Batching
- Repeating for multiple epochs
- shffling

For further pre-processing check [the preprocessing guide](https://www.tensorflow.org/guide/data#preprocessing_data)

and have a look [here for data preprocessing](https://www.tensorflow.org/tutorials/load_data/csv#data_preprocessing)

## Batching

In [0]:
batched_dataset = dataset.batch(7, drop_remainder=True) # drop_remainder=True will drop the last batch if it is not a full batch.

## Repeating for Multiple Epochs

Applying the Dataset.repeat() transformation with no arguments will repeat the input indefinitely.

The Dataset.repeat transformation concatenates its arguments without signaling the end of one epoch and the beginning of the next epoch. Because of this a Dataset.batch applied after Dataset.repeat will yield batches that straddle epoch boundaries:

In [0]:
titanic_lines.repeat(3).batch(128) # will have last batch of epoch 3 with a smaller size
titanic_lines.batch(128).repeat(3) # will create smaller batches at every last batch of an epoch

## Shuffling

The Dataset.shuffle() transformation maintains a fixed-size buffer and chooses the next element uniformly at random from that buffer.

In [0]:
dataset.shuffle(buffer_size=100)

As with Dataset.batch the order relative to Dataset.repeat matters.

Dataset.shuffle doesn't signal the end of an epoch until the shuffle buffer is empty. So a shuffle placed before a repeat will show every element of one epoch before moving to the next:

In [0]:
dataset.shuffle(buffer_size=100).repeat(2).batch(10)
# not the same as
dataset.repeat(2).shuffle(buffer_size=100).batch(10)

# License

Copyright 2019 The TensorFlow Authors and 2020 Björn Schmidtke for GSERM

In [0]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.