# Data Ingestion

Data ingestion step involves in obtaining data for a ML process. In this step need to consider the training/test sets, data type we are ingesting (text, stuctured, images etc.), combination of multiple sources etc.

In the ingestion step, before passing the data to next step we need to separate data to training/validation sets and then convert those into `TFRecord` files containing the data represented as `tf.Example` data structures.

`TFRecord` is a lightweight format optimized for streaming large datasets. In practice many tensorflow users store serialized example Protocol Buffers in TFRecord files. These file type support any binary data as shown in below example.

In [6]:
import tensorflow as tf

with tf.io.TFRecordWriter("data/test.tfrecord") as w:
    w.write(b"First Record")
    w.write(b"Second Record")

for record in tf.data.TFRecordDataset("data/test.tfrecord"):
    print(record)


tf.Tensor(b'First Record', shape=(), dtype=string)
tf.Tensor(b'Second Record', shape=(), dtype=string)


TFRecord files contains tf.Example records (which acts as rows IMO) and more details regarding this can be read at [Tensorflow Docs](https://www.tensorflow.org/tutorials/load_data/tfrecord).

But generally storing our data as TFRecords and tf.Examples provides benefits including system independence since it is implemented using `Protocol Buffers` a cross-platform cross-language libary to serialize data, optimizations for downloading/writing large amount of data quickly and compatibility with Tensorflow ecosystem in general.

> The process of ingesting. splitting and converting datasets is performed using the `ExampleGen` component of the TFX.